Refactor Mandelbrot benchmark #152

GabrielMajeri · 2018-09-09T09:43:11Z

This PR refactors the whole Mandelbrot benchmark:

there are only three supported algorithms now, scalar / simd / ispc, and they are all parallelized.
argument parsing for binary is done with structopt
scalar implementation rewritten with a struct + iterator, helps auto-vectorization.
some optimizations (e.g. precalculate the initial x values for the scalar and SIMD implementations)

Benchmark results for image of size 16000x16000:

scalar: 5.4s
ispc: 3.7s
simd: 1.8s
fastest C++ implementation: 1.1s

Unfortunately, we're not gonna get much closer to the C++ implementation, since some of those optimizations would take away our ability to create color images.

gnzlbg · 2018-09-09T13:02:38Z

and they are all parallelized.

Why were the serial implementations removed? I think it is important for the examples to show the cartesian product of (scalar, simd) x (serial, parallel) implementations, so that one can see the differences for example when going scalar-serial to scalar-parallel vs simd-serial to simd-parallel.

Unfortunately, we're not gonna get much closer to the C++ implementation, since some of those optimizations would take away our ability to create color images.

I think for this it would make more sense to have a different parallel simd algorithm that just does not reuse the current ppm/pbm system but does it all in one go. That can be added in a different PR though.

fastest C++ implementation: 1.1s

Maybe for benchmarking we could add this to the benchmark similarly to how we do it for the ispc kernels.

GabrielMajeri · 2018-09-09T13:07:18Z

@gnzlbg

Why were the serial implementations removed?

For consistency and for code reuse. ISPC never had a single-threaded version (the name was misleading), and the simd and simd_par were extremly similar, basically copy&paste and change chunks into par_chunks.

I don't know if there is any way to use generics to avoid this duplication.

I think for this it would make more sense to have a different parallel simd algorithm that just does not reuse the result

I guess, but that would mean even more copy&paste. I also tried to get a solution for this using generics / traits, but the issue is that for the B/W version, we don't even have to store the count. It's completly different from what we're conceptually doing now.

GabrielMajeri · 2018-09-09T16:50:32Z

@gnzlbg

Instead of adding a separate single-threaded implementation, would it be OK to add a command line switch to limit rayon's number of threads with a command line argument? Something like:

./mandelbrot 800 600 --threads 1 --algo vector

I also have the idea to make the output code SIMD for all code paths. It's not part of the benchmark per se, so it would be best to make it as fast as possible.

GabrielMajeri · 2018-09-09T16:58:25Z

Also, a weird thing: widening all vector types to _x8 from _x4, and removing mul_adde, gave me a 10% performance boost, when compiling for SSE2.

gnzlbg · 2018-09-09T17:07:20Z

Instead of adding a separate single-threaded implementation, would it be OK to add a command line switch to limit rayon's number of threads with a command line argument? Something like:

The examples serve three purposes, in decreasing order of importance:

teaching: they show new users how to write SPMD like code with this library and rayon.
measure codegen quality: by comparing the same algorithm implementation in both ISPC and rust
beat the benchmarks game

From the teaching point-of-view I think it is useful for new users to see the scalar+serial code first, and then see how it changes in the both axes (scalar+parallel, vector+serial), and then be able to see how to combine both. If we directly provide the vector+parallel implementation, I am afraid that new users won't be able to make the connections required to go from the scalar+serial case to the vector+parallel case on their own problems.

For measuring codegen quality against ISPC we need to compare the same algorithms, and need to be particularly careful about not doing optimizations by hand that compilers are not allowed to do automatically (e.g. because they could affect the results).

Ideally, we wouldn't need to sacrifice any of these two objectives at all to beat the benchmarks game. However, that is often just not realistic. For this third objective, I would prefer if we, along the scalar+serial, scalar+parallel, vector+serial, and vector+parallel implementations, a 5th implementation, e.g. vector+parallel+benchmarks_game, that does that. It doesn't have to reuse anything that the others two do (e.g. it doesn't have to use the same code for the b&w picture, it doesn't have to be able to print color pictures, etc.) and we could make that the default so that this crate would be fast when passed the same arguments that the benchmarks game passes.

What do you think?

GabrielMajeri · 2018-09-09T17:24:56Z

new users won't be able to make the connections required to go from the scalar+serial case to the vector+parallel case on their own problems

I agree that the purpose of this example is to educate, which is why the code heavily commented, but the difference between serial and parallel is literally a single par_chunks versus a chunks.
We can tell people to imagine that we wrote chunks there, and that would be it. The big difference exists between the scalar and SIMD code.

This is an embarassingly parallel problem, the parallel code is so simple it's practically not there. I agree that for some other benchmarks the parallel code is a lot different than the serial code, but in this case, it's not different enough to warrant a different implementation.

For measuring codegen quality against ISPC we need to compare the same algorithms,

I disagree about comparing codegen. Our objective should be to write a fast implementation while sticking to idiomatic Rust code (and there's room for improvement here). Nobody said ISPC's implementation is state-of-the-art. Besides, some of these optimizations provide more educational examples than ISPC's code.

Ideally, we wouldn't need to sacrifice any of these two objectives at all to beat the benchmarks game.

I think we might be able to win without sacrificing anything. I did some tests with the existing parallel SIMD code, and (while my changes broke the ability to output color images) I got within 12% of the fastest C++ code. I think it might be possible to work with the compiler and get even closer to the target performance.

gnzlbg · 2018-09-09T17:50:18Z

I agree that the purpose of this example is to educate, which is why the code heavily commented, but the difference between serial and parallel is literally a single par_chunks versus a chunks.
We can tell people to imagine that we wrote chunks there, and that would be it. The big difference exists between the scalar and SIMD code.

This makes sense. I'm fine with it as long as we document this for the scalar+parallel code.

Nobody said ISPC's implementation is state-of-the-art.

Neither did I. ISPC's SPMD code is much nicer to read and write than packed_simd+rayon, and until recently it was still faster in most archs. Comparing against ISPC has been a valuable source of potential improvements to packed_simd internals that has significantly improved its performance. Comparing packed_simd against anything that's better than ISPC at code clarity and/or performance would be a great addition. One option would be to compare against the C++ code used in the benchmarks game by adding it and linking against it in the same way that we currently do with ISPC.

GabrielMajeri · 2018-09-09T18:00:07Z

Alright, I've rewritten the SIMD implementation to be similar to the scalar one, and use an iterator. I haven't noticed any performance loss. You can compare the scalar and SIMD code now, they're basically the same line-for-line, only difference is all scalar data types are now SIMD-ified. I don't think it gets any easier for learning at this point.

Unfortunately, the scalar code has a diverging function, while the SIMD code has a undiverged function, because as I said before, replacing less-equal with greater-than breaks the tests, for some unknown reason. As I said, I am no SIMD expert, so it would be great if you could investigate that.

With this change, we keep color-output support, and we're within 20% of the fastest implementation's performance. I plan to optimize the output functions after I get this PR merged, then see if we can apply some more tricks from the C++ implementation.

gnzlbg · 2018-09-10T08:31:07Z

This looks really good! Please ping me when CI is green, and thank you for doing this and putting up with my often wrong ideas (if you are on IRC, sometimes its easier to quickly get on the same page there than through here).

gnzlbg · 2018-09-10T17:24:02Z

I think it would be enough to just disable the test on windows, and fill an issue to not forget about this.

They fail for some unknown reasons.

gnzlbg · 2018-09-11T01:03:14Z

The sparc64 build bots are reaching the time limit of 50 min, we should disable one example there or two to put them back in track, otherwise CI becomes unreliable when travis is under high load.

GabrielMajeri force-pushed the refactor-mandelbrot branch 2 times, most recently from 3f25184 to a4bffe1 Compare September 9, 2018 14:27

GabrielMajeri force-pushed the refactor-mandelbrot branch from 4145a39 to b75aea8 Compare September 9, 2018 17:58

GabrielMajeri force-pushed the refactor-mandelbrot branch 2 times, most recently from 613cdd3 to 37a917f Compare September 10, 2018 15:26

GabrielMajeri added 2 commits September 10, 2018 20:17

Refactor Mandelbrot benchmark

489908c

Remove ISPC from output comparison test

1850078

GabrielMajeri force-pushed the refactor-mandelbrot branch from 448714f to 1850078 Compare September 10, 2018 17:17

Ignore SIMD tests on Windows

dbb4833

They fail for some unknown reasons.

gnzlbg merged commit 1629005 into rust-lang:master Sep 11, 2018

GabrielMajeri deleted the refactor-mandelbrot branch September 11, 2018 03:30

GabrielMajeri mentioned this pull request Sep 11, 2018

Mandelbrot SIMD tests fail on Windows #159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Mandelbrot benchmark #152

Refactor Mandelbrot benchmark #152

Uh oh!

GabrielMajeri commented Sep 9, 2018 •

edited

Loading

Uh oh!

gnzlbg commented Sep 9, 2018 •

edited

Loading

Uh oh!

GabrielMajeri commented Sep 9, 2018

Uh oh!

GabrielMajeri commented Sep 9, 2018

Uh oh!

GabrielMajeri commented Sep 9, 2018

Uh oh!

gnzlbg commented Sep 9, 2018

Uh oh!

GabrielMajeri commented Sep 9, 2018

Uh oh!

gnzlbg commented Sep 9, 2018 •

edited

Loading

Uh oh!

GabrielMajeri commented Sep 9, 2018 •

edited

Loading

Uh oh!

gnzlbg commented Sep 10, 2018

Uh oh!

gnzlbg commented Sep 10, 2018

Uh oh!

gnzlbg commented Sep 11, 2018

Uh oh!

Uh oh!

Refactor Mandelbrot benchmark #152

Refactor Mandelbrot benchmark #152

Uh oh!

Conversation

GabrielMajeri commented Sep 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnzlbg commented Sep 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GabrielMajeri commented Sep 9, 2018

Uh oh!

GabrielMajeri commented Sep 9, 2018

Uh oh!

GabrielMajeri commented Sep 9, 2018

Uh oh!

gnzlbg commented Sep 9, 2018

Uh oh!

GabrielMajeri commented Sep 9, 2018

Uh oh!

gnzlbg commented Sep 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GabrielMajeri commented Sep 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnzlbg commented Sep 10, 2018

Uh oh!

gnzlbg commented Sep 10, 2018

Uh oh!

gnzlbg commented Sep 11, 2018

Uh oh!

Uh oh!

GabrielMajeri commented Sep 9, 2018 •

edited

Loading

gnzlbg commented Sep 9, 2018 •

edited

Loading

gnzlbg commented Sep 9, 2018 •

edited

Loading

GabrielMajeri commented Sep 9, 2018 •

edited

Loading