-
-
Couldn't load subscription status.
- Fork 57
Fixed int128t issue. Hopefully. #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 48.18% (diff: 100%)@@ master #250 diff @@
==========================================
Files 2 2
Lines 907 907
Methods 42 42
Messages 0 0
Branches 195 195
==========================================
Hits 437 437
Misses 392 392
Partials 78 78
|
5194fce to
2f4f505
Compare
|
@vehre thanks for taking a look!
|
|
LGTM I confirmed that the only test failure with latest GCC trunk (built today) is still the |
|
@zbeekman the latest virtual machine now has a trunk build from the same date cited above and I too confirm that there is only one test failure, but my recollection is that it's a different than the once cited above. In the virtual machine, the one test that fails is get_with_offset_1d.f90, which runs as test 9. I thought that was the same one you were seeing on OS X. No? @vehre, I believe my previous problems with building in the virtual machine related to some twisted logic in the environment modules. The problem has mostly been sorted out. Would it be helpful for you to download the latest virtual machine and run our tests with the default compiler in the sourcery student account? At least then you would be able to reproduce the exact scenario that I'm seeing. |
|
@rouson @zbeekman and I figured, that the testcase # 9 is missing synchronisation. Because everything is parallel in the testcase, the image doing the check, might start doing so before the data has been set by the remote images. So there may still be an issue with the sync all. I will download newest VM and recheck. |
|
@vehre After looking at the standard some more and discussing with others, I'm not convinced that the issue is with missing synchronization. Gets, e.g., Also, I don't think we need explicit synchronization at the end of the test: shouldn't the runtime ensure the buffers are flushed before terminating the program? I think I saw some comments in |
|
I'll do some more testing and reading up on the standard to confirm this, but I think what I said is true. |
|
I just reviewed the test. Although I could read the standard some to confirm, my experience with CAF indicates to me that the test is correct and the failure of the test represents a bug either in gfortran or in OpenCoarrays. To corroborate that, I'll mention that @afanfa has previously communicated to me that the standard requires "gets" are blocking, which really is the only thing that could make sense, given the lack of a mechanism in Fortran 2008 for checking whether the communication has completed. @zbeekman, you raise a very interesting and subtle point regarding whether the second synchronization is needed. I have generally assumed it is, but you caused me to think more about it and I'm not certain for the following reason: program termination is a three step process: (1) initiate termination, (2) synchronize, (3) terminate. I wonder if an error stop is able to cause error termination of all images during step (2). My suspicion is yes, in which case the second synchronization is not needed. I'll be seeing Bill Long of Cray, Inc., later today and hope to have a moment to probe him on this. Cray invented coarrays. Bill was the lead author on the TS that added the new parallel features in Fortran 2015 and I assume he was involved in designing the coarray features of 2008 so he's an authority. |
|
I'm going to refactor the test to make the code more compact and modern, e.g., using an array statement instead of a loop. My changes should not affect the outcome. At some point soon, I should go through all the tests and make similar changes. |
|
Now, as I have explained already, is testcase # 9 using an allocatable array on the lhs, namely the b. This mandates the use of the new API. The new API publishes the array descriptors as mpi-windows in this case the "a". Those array descriptors live on the stack of the main program. Don't mix up array descriptors and array data. The array data is allocatable and allocated using ALLOCATE(). The array descriptor is the meta-data giving the rank the extends in each dimension and the datatype of the array content. Now when an image leaves the main program, its stack is voided and caf_finalize() is called, which overwrites the memory published as mpi-window. This can cause a crash or inconsistent data. In principle is the calling sequence starting from invocating the program from a shell in every Fortran program like this: int main(int argc, char **argv) { void MAIN__() { a.data = 0B; // Set that the allocatable array is unallocated. I am still thinking of how to fix this. Approaches may be:
I don't like the latter, because it may add to many synchronisation points. What are your opinions?
|
At the end of which function? Adding synchronization when exiting the Fortran "main" seems like it might be an acceptable solution to me, but I defer to @afanfa's expertise. The only obvious case that concerns me is if there is a failed image, or unexpected termination of some sort on one of the images, that will then cause all the other images to hang in, while waiting for a I'm not sure I fully understand the implications of approach # 1: the array descriptors (and data, right?) would stick around until the barrier in |
Yes, your understanding is correct. In this special case yes to MAIN__(). But in fact would that approach need synchronisation in every function that has coarray stack variables, which makes the approach not feasible. I did some research what it would need to make the variable static: exactly one statement in the compiler and instantly the issue is gone. Btw, on my system the testcases changed their numbering: get_with_offset_1d is now # 10. With my newest copy of git:///vehre/coarray only testcase get_self/sameloc.f90 fails.
Right, they would go to an rss-segment in the executable and no longer on the stack. Because gfortran does not have this variable declared as SAVE, the variable will be reinitialized upon reentering a function. Shoot, that's the next issue. Imagine this is done in a loop. Oh no, what a nightmare. |
yes, sorry for the confusion surrounding the numbering, I re-enabled some more tests in 76c4492 I've got a patch for building GCC from trunk on OS X now, so I'll take another look at the test failures. I'm a little bit surprised the issue you mentioned with array descriptors on the stack getting corrupted by |
It couldn't. Only with introducing the new API (the caf*_by_ref()-functions) those descriptors need to be published. The old API did not need the descriptor present any time, because those needed to in-sync on all images, so publishing them was not necessary. And with allowing reallocation of the destination buffer by the API-function this information needs to be present (and changed). |
Those are needed only by the new API of caf*_by_ref() functions. Because this API allows reallocation of result arrays the descriptor needs to be fixed accordingly, which of course needs the descriptor to be available. Furthermore does the new API need the descriptor for reffing into the arrays. The old API implemented reffing only for arrays, that are same sized on all images. Therefore supplying the image's local descriptor to the communication-routine was sufficient. But with the support for allocatable components, each component can have a different size on the different images, which enforces the need for the array descriptor to be accessible. While designing the new API we tried to be as universal as possible and therefore the array descriptor is needed whenever a coarray is registered, even when the descriptor is never used. |
|
I have analyzed the issue in sameloc further. With introducing the new API sendget(_by_ref) is used, whenever a coarray is present on the left- and right-hand side of an assignment. Debugging into _caf_sendget() I do not see the stride respected properly. I therefore deem the bug in _caf_sendget(). I tried to fix this by branching to _caf_get() when the image_index_s is this_image. This helped for cases where only get() was involved, unfortunately is also a send involved. |
|
FYI: caf_send does not support the "advanced" strided transfer support implemented in caf_get and caf_send. When a strided transfer is required, caf_sendget performs the transfer element-by-element. |
|
@afanfa does |
|
@afanfa : I totally understand that caf_sendget does not have an optimized version for strided access, but I claim that caf_sendget does not work at all with strided access. IMO does it not work when the stride is in the get-part nor when it is in the send part. I have attached a testcase, that shows my claim. |
|
@zbeekman: yes, STRIDED must be defined to enable the efficient strided transfer (but in caf_sendget is not implemented). You want to disable it when you are dealing with strided transfers of derived type coarrays. In that case, the library is forced to send the data byte-by-byte. |
This PR is to supersede #237 if that's OK with @vehre and everyone else.