Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace OpenMP with tasklets #318

Closed
wants to merge 3 commits into from
Closed

replace OpenMP with tasklets #318

wants to merge 3 commits into from

Conversation

andlaus
Copy link
Contributor

@andlaus andlaus commented May 2, 2018

this makes multi-threaded linearization available without having to build with OpenMP, a c++-2011 compliant compiler is enough. as an added benefit, performance seems to be slightly better.

this PR requires some pretty trivial mop-up in opm-simulators.

@andlaus
Copy link
Contributor Author

andlaus commented May 2, 2018

note that this PR removes more lines of code than it adds/modifies.

@andlaus
Copy link
Contributor Author

andlaus commented May 2, 2018

jenkins build this opm-simulators=1472 please

@andlaus
Copy link
Contributor Author

andlaus commented May 7, 2018

any comments?

@atgeirr
Copy link
Member

atgeirr commented May 8, 2018

any comments?

I have only taken a cursory look, but this is not a trivial change. Smart, possibly. Good, possibly. Not adding any subtle bugs... no idea!

For example, you are now manipulating mutexes inside some of the lambdas. Seems to me that it could cause performance loss, but you do not see it. Of course that is good in itself, but I find it hard to reason about what to expect and not! It seems to me that the code complexity at the parallelized code sites actually is larger than it was. (Most of the removed lines are due to the threadmanager.hh and locks.hh files being removed.) That is not good, I think.

The most serious drawback in my opinion is replacing a known standard (OpenMP) with a homegrown (mostly standard c++, but still) library cannot be considered mature at this point. I therefore come down against merging this, despite the benefit that it can build without OpenMP (I cannot really see why that is a problem). I know that these OpenMP pragmas probably were rarely used, but that in itself is no reason to replace it. (Rather, to remove it...)

@andlaus
Copy link
Contributor Author

andlaus commented May 9, 2018

before I answer your concerns, let it be clear that this change can not have any negative effects if you did not build with OpenMP, i.e., I bet almost always, certainly the distribution packages have never been build with OpenMP enabled. Also, for my taste, your arguments are too much in the subjunctive while doing nothing about the concrete problem of getting this functionality into the hands of more people.

For example, you are now manipulating mutexes inside some of the lambdas

yes; that's no change compared to manipulating mutexes in #pragma omp parallel sections. If I understand how OpenMP is implemented correctly, it does essentially the same.

Seems to me that it could cause performance loss, but you do not see it. Of course that is good in itself, but I find it hard to reason about what to expect and not!

yes, performance seems to be slightly better. but why don't you check this yourself and report on the results? it is quite simple: you only need to build with OpenMP enabled, then all ewoms tests can do multi-threaded linearization on both master and this branch by passing --threads-per-process=$N. (flow can't do this at the moment.)

It seems to me that the code complexity at the parallelized code sites actually is larger than it was.

if you mean the complexity of "change the #pragma omp parallel section to a lambda, get the tasklet runner, determine the number of threads per process and dispatch the lambda", you're right. But that is a trivial amount of complexity. (if you disagree on that, I'll remind you on the next PR that you want to get merged ;))

The most serious drawback in my opinion is replacing a known standard (OpenMP) with a homegrown (mostly standard c++, but still) library cannot be considered mature at this point.

well, OpenMP is a standard, but not an universally available one: e.g. try to use it with older clang versions. even for the compilers that do implement it, it is hidden behind a flag. I think it is a good idea to make this functionality available-but-disabled by default, so that the code can be matured more easily.

I am also pretty sure that the main "immaturity" which you're talking about is not the multihreading mechanism but the code itself and currently it is next to impossible for non-developers to excercise it. even most developers don't seem to bother, or when was the last time that you've used the OpenMP codepath? want to bet a crate of the beverage of your choice against this?

further, OpenMP is not flexible enough and not widely available enough to do stuff like asynchronous I/O, so we need (something like) a mature tasklets mechanism anyway. this PR helps in gaining that confidence.

I know that these OpenMP pragmas probably were rarely used, but that in itself is no reason to replace it. (Rather, to remove it...)

true, but bringing mere mortals into a position to test this functionality is a very good reason for this change IMO. Removing the pragmas outright would be a regression, i.e., we would loose functionality that used to work. I've been told that this is a bad thing.

also, if the user does not explicitly set the ThreadsPerProcess parameter, synchronous mode is used and all of this discussion is mood.

finally, we're quite early into the 2018.10 development cycle, i.e., it is time to merge somewhat experimental features to the development branch and root out bugs before release -- or, if this unexpectedly turns out to be impossible, revert them.

@totto82
Copy link
Member

totto82 commented Jun 4, 2018

Can this be combined in a smooth way with open-MP in other parts of the code i.e. the linear solvers, parser etc?

@andlaus
Copy link
Contributor Author

andlaus commented Jun 4, 2018

Can this be combined in a smooth way with open-MP in other parts of the code

yes. it will not have any negative effect on code that uses openMP. (The only change which happens if openMP is used in conjunction with this is that a few additional threads will be created.)

@totto82
Copy link
Member

totto82 commented Jun 5, 2018

I have tested this on norne and it seems to work fine. I like the feature that I can use this without recompiling and the ability to set the number threads from the command line. I don't know too much the technical details here, but from a users point of view the feature is welcome.

@blattms
Copy link
Member

blattms commented Jun 8, 2018

Was hoping to skip working on this issue but as it takes up momentum I will do some testing with this and OpenMP. Currently my tendency is to not introduce homegrown substitutes for already existing features.

@blattms
Copy link
Member

blattms commented Jun 13, 2018

I did some runs. Here are the results on my 32-threads system (omp vs. tasklets, vs MPI)

Threads Ass. OMP total OMP Ass. task total task Ass MPI total MPI
1 276 516 261 502 276 516
2 223 468 214 461 168 339
4 161 414 162 416 86 217
8 103 353 107 348 53 167
16 77 330 91 345 36 171
32 90 346 94 352 36 242

I also tried to combine MPI/OPM to get rid of the increasing number of linear steps of the linear solver. That works of course but the total runtime is still bad if we saturate the processor. The assembly time even goes up (maybe the problems are so small that OMP has too much of overhead):

MPI procs OMP threads/proc Ass. total
8 2 58 224
8 4 44 183
16 2 48 216

I guess the runtime differences between tasklets and omp should be called noise in this case. (Maybe except for the run with 16 threads where OMP seems to have less overhead). It seems like there is no gain in switching here. In contrast we would need to invest more maintainance work. In addition there are already a variety of parallelization libraries in use and mixing/matching might become a nightmare. EXA-Dune settled on threading building blocks (TBB) and I recall there was a PR in core dune to support threads within ISTL' s parallelization approach. (Unfortunately, I am having a hard time finding this.)

IMHO my numbers also show that the these easy fork/join thread parallelization approaches currently in OPM are not very powerful. It should be possible to see the same speedups for the parallelization of the assembly as for MPI but that would need a proper implementation to take advantage of data locality.

@blattms
Copy link
Member

blattms commented Jun 13, 2018

@totto82 one could add a command line parameter for OMP, too. Actually, ewoms already supports this. It is just not yet supported by OPM. The patch would be nearly the same as the one for the tasklets in opm-simulators.

@andlaus
Copy link
Contributor Author

andlaus commented Jun 13, 2018

@blattms still, it would need to be compiled with non-standard compiler flags to enable the functionality. any idea how to get around this?

@andlaus
Copy link
Contributor Author

andlaus commented Jun 13, 2018

thanks for testing. did you test with K MPI processes and J threads per MPI process, where K is the number of memory channels of your machine and K*J equals to the number of cores of your processor, i.e., K is most likely 4 and J would be 4 or 8.

@andlaus
Copy link
Contributor Author

andlaus commented Jun 13, 2018

In contrast we would need to invest more maintainance work.

not really: tasklets are required for asynchronous output anyway, i.e., as long as @alfbr insists on having async I/O we need to maintain the tasklets infrastructure or something similar.

IMHO my numbers also show that the these easy fork/join thread parallelization approaches currently in OPM are not very powerful. It should be possible to see the same speedups for the parallelization of the assembly as for MPI but that would need a proper implementation to take advantage of data locality.

MPI is probably faster because it makes it possible for the kernel to allocate memory so that the threads of the simulator do not impede each other while accessing memory as much. that said, with the current design of the parser, a lot of additional memory is required for a pure MPI solution (IIRC a lot means 6GB per core for Model2?) and as soon as enough processes exist to keep all memory channels busy, the speed up of MPI compared to threads should vanish.

@alfbr
Copy link
Member

alfbr commented Jun 13, 2018

as long as @alfbr insists on having async I/O we need to maintain the tasklets infrastructure or something similar.

The async IO is currently very beneficial on my end. However, maintaining the asyncIO threading implementation is hardly what was meant. Just looking at the tasklets threading for the assembly part in isolation, it seems to make the threading far more intrusive than the old openmp implementation, and hence makes the code harder to maintain. Do you have any argument as to why you put in all this new threading code, i.e., are there any benefits? Do you plan on maintaining it yourself?

@andlaus
Copy link
Contributor Author

andlaus commented Jun 13, 2018

However, maintaining the asyncIO threading implementation is hardly what was meant.

yes, but it is what it is!

Just looking at the tasklets threading for the assembly part in isolation, it seems to make the threading far more intrusive than the old openmp implementation,

could you please be a tad more specific?

Do you have any argument as to why you put in all this new threading code, i.e., are there any benefits?

plenty:

  • As said exhaustively, the new solution is neither based on a new nor on a currently unused mechanism
  • the additional complexity compared to openMP is trivial
  • as confirmed by @blattms' numbers above, performance seems to be largely the same; sometimes it's slightly better sometimes slightly worse than openMP
  • the tasklets-based solution makes multi-threaded linearization accessible to mere mortals who do not compile the stuff themselves with obscure compiler flags. I might be wrong, but I'd like everyone who reads this and who has regularly compiled and used flow with openMP support before this PR to state so in a comment.
  • all people who complained about having to maintain the tasklets code (a) added much bigger maintenance burdens in the past (b) never touched the old code nor (c) will have to maintain the new stuff (that said, I have to maintain this either way, and I consider the non-openMP solution quite advantageous)
  • Since tasklets are required for async-I/O anyway, using it in as many contexts as possible increases confidence in it (which is already quite high on my end)

Do you plan on maintaining it yourself?

yes, just like the vast majority of the code of opm-material and eWoms.

@blattms
Copy link
Member

blattms commented Jun 13, 2018

still, it would need to be compiled with non-standard compiler flags to enable the functionality. any idea how to get around this?

This must be a rethoric question. CMAKE_CXX_FLAGS_RELEASE are far from standard. Added a flag depending on a specific compiler should be possible like for other things previously, too.

@blattms
Copy link
Member

blattms commented Jun 13, 2018

the tasklets-based solution makes multi-threaded linearization accessible to mere mortals who do not compile the stuff themselves with obscure compiler flags. I might be wrong, but I'd like everyone who reads this and who has regularly compiled and used flow with openMP support before this PR to state so in a comment.

Like I said that can easily be changed in the build system if it seems beneficial.

Since tasklets are required for async-I/O anyway, using it in as many contexts as possible increases confidence in it (which is already quite high on my end).

This is a rather new addition (OPM/opm-simulators#1434, #299), which I recall caused some problems initially (see OPM/opm-simulators#1437). Tasklets were not used a quater of a year ago. I wonder why async output code was rewritten... At least for me the pull request description does not reveal this.

@alfbr
Copy link
Member

alfbr commented Jun 13, 2018

could you please be a tad more specific?

Sure. With the OpenMP implementation, you see this:

#ifdef _OPENMP
#pragma omp parallel
#endif

it is well known, well documented and well understood what it does. Tasklets on the other hand requires the understanding of the implementation done by you, with all the possible pitfalls associated with multi threading.

the new solution is neither based on a new nor on a currently unused mechanism

What do you mean? The tasklet solution is new, introduced by you a couple of months ago. I am still completely in the dark as to why you did it? Have you checked that the tasklet implementation maintains the performance benefits of the original async IO implementation?

the additional complexity compared to openMP is trivial

I appreciate and respect the opinion of others on this. If multiple developers feel the same way, i have no problem with it.

the tasklets-based solution makes multi-threaded linearization accessible to mere mortals

adding -fopenmp to the default compile switches achieves the same, so your statement puzzles me. I would even argue that it is a feature. With this Pull Request we can no more turn off compilation of the multi threading for linearization (as it would remove the asyncIO with it).

Was this your only motivation for rewriting all the threading code that was already working well, with (as far as I know) little need for maintenance? Honestly open for compelling reasons.

@andlaus
Copy link
Contributor Author

andlaus commented Jun 14, 2018

could you please be a tad more specific?

Sure. With the OpenMP implementation, you see this:

#ifdef _OPENMP
#pragma omp parallel
#endif

with tasklets the parallel scope is changed to a lambda and this is used instead:

        auto& taskletRunner = simulator_.taskletRunner();
        int numThreads = std::max(taskletRunner.numWorkerThreads(), 1);
        taskletRunner.dispatchFunction(updateWellDataLambda, /*numInvokations=*/numThreads);
        taskletRunner.barrier();

you can't seriously call this "additional complexity"!

adding -fopenmp to the default compile switches achieves the same, so your statement puzzles me.

changing the default flags in the OPM build system as @blattms suggests is bound to cause troubles with Dune modules that are build without this.

With this Pull Request we can no more turn off compilation of the multi threading for linearization (as it would remove the asyncIO with it).

this statement is just plain wrong: it is disabled by default because only a single thread is used and removing tasklets would kill async I/O even without this being merged! Also, it is harder to disable with the openMP based solution!


Anyway, could somebody please bring up some technical reasons for not merging it? so far it did not cause any regressions and when repeating "the current solution adheres to standards more" ad infinitum this this guy should to be taken seriously IMO: https://lwn.net/Articles/756732/

So yeah, I'm tired of this discussion and I'm also tired of openMP. Since I'm the only guy who did any non-trivial changes to this code over the (at least) last 4 years, I offer the following two solutions:

  1. remove openMP support from eWoms without a replacement. this would not constitute a regression for flow because multi-threading is disabled by the current flow master and everbody seemed to have been happy with that
  2. merge this PR.

in both cases it is relatively easy to fork the affected classes to opm-simulaters and go with whatever fancy multithreading mechanism you prefer. If you want this, I'll do the forking work, but somebody else needs to explicitly take over maintainership responsibility for the forked version.

you decide.

@andlaus
Copy link
Contributor Author

andlaus commented Jun 14, 2018

#326 implements option (1).

@atgeirr
Copy link
Member

atgeirr commented Jun 14, 2018

(I commented in the other PR before I saw your latest comment @andlaus, but I will try to keep the disussion here).

I think those two options are not the only ones. We also have:

  1. Embrace OpenMP actively.
  2. Embrace TTB for this type of thing.
  3. Do nothing.

@andlaus, I think that you are trying to force the issue here, which is good for progress (unless the end is alternative 5, we should avoid that...)! But I am not sure the rest of us (or me at least) are ready to commit to an option yet. And I think it is not fair to say that either we eliminate all traces of thread-parallelism or we have to do it with the tasklets.

@andlaus
Copy link
Contributor Author

andlaus commented Jun 14, 2018

for options (3) to (5) I'm not available within eWoms, but you can do this for the then-new opm-simulator forks if you desire to go down this path.

And I think it is not fair to say that either we eliminate all traces of thread-parallelism or we have to do it with the tasklets.

I did not say that: I offered to fork the relevant code to opm-simulator if somebody else is willing to take up the burden of maintaining these forks, i.e. "backseat maintainership" (deciding technical stuff without being willing to bear the consequences) is not acceptable for me.

@andlaus
Copy link
Contributor Author

andlaus commented Jun 21, 2018

jenkins build this please

@andlaus
Copy link
Contributor Author

andlaus commented Jun 21, 2018

jenkins build this opm-simulators=1472 please

@andlaus
Copy link
Contributor Author

andlaus commented Jun 25, 2018

jenkins build this opm-simulators=1472 please

1 similar comment
@andlaus
Copy link
Contributor Author

andlaus commented Jul 12, 2018

jenkins build this opm-simulators=1472 please

@andlaus
Copy link
Contributor Author

andlaus commented Aug 20, 2018

jenkins build this please

1 similar comment
@andlaus
Copy link
Contributor Author

andlaus commented Aug 20, 2018

jenkins build this please

@alfbr
Copy link
Member

alfbr commented Aug 20, 2018

I am not sure implementing one of the OpenMP environment variables makes sense here. The implementation in this pull request is not OpenMP based, and I assume it does not implement any of the other OpenMP environment variables, ref. https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html

Of course this also exemplifies why rolling your own threading implementation may not be the best idea.

@andlaus
Copy link
Contributor Author

andlaus commented Aug 20, 2018

if you promise to not complain about this when the time comes to merge this PR, I'm more than happy to get rid of the OMP_NUM_THREADS hack again...

@alfbr
Copy link
Member

alfbr commented Aug 21, 2018

My preference is to not spend any time on this PR now. If this PR is indeed merged, I think honouring OpenMP environment variables makes little sense, so I would not complain about their removal. No strong objections if it is kept either.

@andlaus
Copy link
Contributor Author

andlaus commented Aug 25, 2018

My preference is to not spend any time on this PR now.

then let's get over it and merge it immediately!

@andlaus
Copy link
Contributor Author

andlaus commented Feb 5, 2019

jenkins build this please

1 similar comment
@andlaus
Copy link
Contributor Author

andlaus commented Feb 5, 2019

jenkins build this please

this makes multi-threaded linearization available without having to
build with OpenMP. as an added benefit, performance seems to be
slightly better.
if OpenMP is enabled and the number of threads is auto-determined,
openMP determines them. although we do not really use OpenMP anymore,
this makes makes the transition a bit simpler.
this makes opm-simulators master happy without any patches.
@blattms blattms closed this Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants