-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace OpenMP with tasklets #318
Conversation
note that this PR removes more lines of code than it adds/modifies. |
jenkins build this opm-simulators=1472 please |
any comments? |
I have only taken a cursory look, but this is not a trivial change. Smart, possibly. Good, possibly. Not adding any subtle bugs... no idea! For example, you are now manipulating mutexes inside some of the lambdas. Seems to me that it could cause performance loss, but you do not see it. Of course that is good in itself, but I find it hard to reason about what to expect and not! It seems to me that the code complexity at the parallelized code sites actually is larger than it was. (Most of the removed lines are due to the threadmanager.hh and locks.hh files being removed.) That is not good, I think. The most serious drawback in my opinion is replacing a known standard (OpenMP) with a homegrown (mostly standard c++, but still) library cannot be considered mature at this point. I therefore come down against merging this, despite the benefit that it can build without OpenMP (I cannot really see why that is a problem). I know that these OpenMP pragmas probably were rarely used, but that in itself is no reason to replace it. (Rather, to remove it...) |
before I answer your concerns, let it be clear that this change can not have any negative effects if you did not build with OpenMP, i.e., I bet almost always, certainly the distribution packages have never been build with OpenMP enabled. Also, for my taste, your arguments are too much in the subjunctive while doing nothing about the concrete problem of getting this functionality into the hands of more people.
yes; that's no change compared to manipulating mutexes in
yes, performance seems to be slightly better. but why don't you check this yourself and report on the results? it is quite simple: you only need to build with OpenMP enabled, then all ewoms tests can do multi-threaded linearization on both master and this branch by passing
if you mean the complexity of "change the
well, OpenMP is a standard, but not an universally available one: e.g. try to use it with older clang versions. even for the compilers that do implement it, it is hidden behind a flag. I think it is a good idea to make this functionality available-but-disabled by default, so that the code can be matured more easily. I am also pretty sure that the main "immaturity" which you're talking about is not the multihreading mechanism but the code itself and currently it is next to impossible for non-developers to excercise it. even most developers don't seem to bother, or when was the last time that you've used the OpenMP codepath? want to bet a crate of the beverage of your choice against this? further, OpenMP is not flexible enough and not widely available enough to do stuff like asynchronous I/O, so we need (something like) a mature tasklets mechanism anyway. this PR helps in gaining that confidence.
true, but bringing mere mortals into a position to test this functionality is a very good reason for this change IMO. Removing the pragmas outright would be a regression, i.e., we would loose functionality that used to work. I've been told that this is a bad thing. also, if the user does not explicitly set the finally, we're quite early into the 2018.10 development cycle, i.e., it is time to merge somewhat experimental features to the development branch and root out bugs before release -- or, if this unexpectedly turns out to be impossible, revert them. |
Can this be combined in a smooth way with open-MP in other parts of the code i.e. the linear solvers, parser etc? |
yes. it will not have any negative effect on code that uses openMP. (The only change which happens if openMP is used in conjunction with this is that a few additional threads will be created.) |
I have tested this on norne and it seems to work fine. I like the feature that I can use this without recompiling and the ability to set the number threads from the command line. I don't know too much the technical details here, but from a users point of view the feature is welcome. |
Was hoping to skip working on this issue but as it takes up momentum I will do some testing with this and OpenMP. Currently my tendency is to not introduce homegrown substitutes for already existing features. |
I did some runs. Here are the results on my 32-threads system (omp vs. tasklets, vs MPI)
I also tried to combine MPI/OPM to get rid of the increasing number of linear steps of the linear solver. That works of course but the total runtime is still bad if we saturate the processor. The assembly time even goes up (maybe the problems are so small that OMP has too much of overhead):
I guess the runtime differences between tasklets and omp should be called noise in this case. (Maybe except for the run with 16 threads where OMP seems to have less overhead). It seems like there is no gain in switching here. In contrast we would need to invest more maintainance work. In addition there are already a variety of parallelization libraries in use and mixing/matching might become a nightmare. EXA-Dune settled on threading building blocks (TBB) and I recall there was a PR in core dune to support threads within ISTL' s parallelization approach. (Unfortunately, I am having a hard time finding this.) IMHO my numbers also show that the these easy fork/join thread parallelization approaches currently in OPM are not very powerful. It should be possible to see the same speedups for the parallelization of the assembly as for MPI but that would need a proper implementation to take advantage of data locality. |
@totto82 one could add a command line parameter for OMP, too. Actually, ewoms already supports this. It is just not yet supported by OPM. The patch would be nearly the same as the one for the tasklets in opm-simulators. |
@blattms still, it would need to be compiled with non-standard compiler flags to enable the functionality. any idea how to get around this? |
thanks for testing. did you test with K MPI processes and J threads per MPI process, where K is the number of memory channels of your machine and K*J equals to the number of cores of your processor, i.e., K is most likely 4 and J would be 4 or 8. |
not really: tasklets are required for asynchronous output anyway, i.e., as long as @alfbr insists on having async I/O we need to maintain the tasklets infrastructure or something similar.
MPI is probably faster because it makes it possible for the kernel to allocate memory so that the threads of the simulator do not impede each other while accessing memory as much. that said, with the current design of the parser, a lot of additional memory is required for a pure MPI solution (IIRC a lot means 6GB per core for Model2?) and as soon as enough processes exist to keep all memory channels busy, the speed up of MPI compared to threads should vanish. |
The async IO is currently very beneficial on my end. However, maintaining the asyncIO threading implementation is hardly what was meant. Just looking at the tasklets threading for the assembly part in isolation, it seems to make the threading far more intrusive than the old openmp implementation, and hence makes the code harder to maintain. Do you have any argument as to why you put in all this new threading code, i.e., are there any benefits? Do you plan on maintaining it yourself? |
yes, but it is what it is!
could you please be a tad more specific?
plenty:
yes, just like the vast majority of the code of opm-material and eWoms. |
This must be a rethoric question. CMAKE_CXX_FLAGS_RELEASE are far from standard. Added a flag depending on a specific compiler should be possible like for other things previously, too. |
Like I said that can easily be changed in the build system if it seems beneficial.
This is a rather new addition (OPM/opm-simulators#1434, #299), which I recall caused some problems initially (see OPM/opm-simulators#1437). Tasklets were not used a quater of a year ago. I wonder why async output code was rewritten... At least for me the pull request description does not reveal this. |
Sure. With the OpenMP implementation, you see this: #ifdef _OPENMP
#pragma omp parallel
#endif it is well known, well documented and well understood what it does. Tasklets on the other hand requires the understanding of the implementation done by you, with all the possible pitfalls associated with multi threading.
What do you mean? The tasklet solution is new, introduced by you a couple of months ago. I am still completely in the dark as to why you did it? Have you checked that the tasklet implementation maintains the performance benefits of the original async IO implementation?
I appreciate and respect the opinion of others on this. If multiple developers feel the same way, i have no problem with it.
adding -fopenmp to the default compile switches achieves the same, so your statement puzzles me. I would even argue that it is a feature. With this Pull Request we can no more turn off compilation of the multi threading for linearization (as it would remove the asyncIO with it). Was this your only motivation for rewriting all the threading code that was already working well, with (as far as I know) little need for maintenance? Honestly open for compelling reasons. |
with tasklets the parallel scope is changed to a lambda and this is used instead: auto& taskletRunner = simulator_.taskletRunner();
int numThreads = std::max(taskletRunner.numWorkerThreads(), 1);
taskletRunner.dispatchFunction(updateWellDataLambda, /*numInvokations=*/numThreads);
taskletRunner.barrier(); you can't seriously call this "additional complexity"!
changing the default flags in the OPM build system as @blattms suggests is bound to cause troubles with Dune modules that are build without this.
this statement is just plain wrong: it is disabled by default because only a single thread is used and removing tasklets would kill async I/O even without this being merged! Also, it is harder to disable with the openMP based solution! Anyway, could somebody please bring up some technical reasons for not merging it? so far it did not cause any regressions and when repeating "the current solution adheres to standards more" ad infinitum this this guy should to be taken seriously IMO: https://lwn.net/Articles/756732/ So yeah, I'm tired of this discussion and I'm also tired of openMP. Since I'm the only guy who did any non-trivial changes to this code over the (at least) last 4 years, I offer the following two solutions:
in both cases it is relatively easy to fork the affected classes to you decide. |
#326 implements option (1). |
(I commented in the other PR before I saw your latest comment @andlaus, but I will try to keep the disussion here). I think those two options are not the only ones. We also have:
@andlaus, I think that you are trying to force the issue here, which is good for progress (unless the end is alternative 5, we should avoid that...)! But I am not sure the rest of us (or me at least) are ready to commit to an option yet. And I think it is not fair to say that either we eliminate all traces of thread-parallelism or we have to do it with the tasklets. |
for options (3) to (5) I'm not available within eWoms, but you can do this for the then-new opm-simulator forks if you desire to go down this path.
I did not say that: I offered to fork the relevant code to opm-simulator if somebody else is willing to take up the burden of maintaining these forks, i.e. "backseat maintainership" (deciding technical stuff without being willing to bear the consequences) is not acceptable for me. |
jenkins build this please |
jenkins build this opm-simulators=1472 please |
jenkins build this opm-simulators=1472 please |
1 similar comment
jenkins build this opm-simulators=1472 please |
jenkins build this please |
1 similar comment
jenkins build this please |
I am not sure implementing one of the OpenMP environment variables makes sense here. The implementation in this pull request is not OpenMP based, and I assume it does not implement any of the other OpenMP environment variables, ref. https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html Of course this also exemplifies why rolling your own threading implementation may not be the best idea. |
if you promise to not complain about this when the time comes to merge this PR, I'm more than happy to get rid of the OMP_NUM_THREADS hack again... |
My preference is to not spend any time on this PR now. If this PR is indeed merged, I think honouring OpenMP environment variables makes little sense, so I would not complain about their removal. No strong objections if it is kept either. |
then let's get over it and merge it immediately! |
jenkins build this please |
1 similar comment
jenkins build this please |
this makes multi-threaded linearization available without having to build with OpenMP. as an added benefit, performance seems to be slightly better.
if OpenMP is enabled and the number of threads is auto-determined, openMP determines them. although we do not really use OpenMP anymore, this makes makes the transition a bit simpler.
this makes opm-simulators master happy without any patches.
this makes multi-threaded linearization available without having to build with OpenMP, a c++-2011 compliant compiler is enough. as an added benefit, performance seems to be slightly better.
this PR requires some pretty trivial mop-up in opm-simulators.