Skip to content

Conversation

@zbeekman
Copy link
Collaborator

This PR is to supersede #237 if that's OK with @vehre and everyone else.

@codecov-io
Copy link

codecov-io commented Nov 11, 2016

Current coverage is 48.18% (diff: 100%)

Merging #250 into master will not change coverage

@@             master       #250   diff @@
==========================================
  Files             2          2          
  Lines           907        907          
  Methods          42         42          
  Messages          0          0          
  Branches        195        195          
==========================================
  Hits            437        437          
  Misses          392        392          
  Partials         78         78          

Sunburst

Powered by Codecov. Last update 611ee79...153efbb

@zbeekman zbeekman force-pushed the wrong-preproc-sym-4-haveint-128 branch from 5194fce to 2f4f505 Compare November 12, 2016 00:51
@zbeekman
Copy link
Collaborator Author

zbeekman commented Nov 12, 2016

@vehre thanks for taking a look!

:shipit:

Approved with PullApprove

@zbeekman
Copy link
Collaborator Author

zbeekman commented Nov 12, 2016

Re-approve after pulling master:

👍

Approved with PullApprove

@zbeekman
Copy link
Collaborator Author

zbeekman commented Nov 12, 2016

LGTM

I confirmed that the only test failure with latest GCC trunk (built today) is still the get_self test (same_loc.f90), more specifically the "get row" portion by compiling the latest GCC trunk and and then MPICH 3.1.4 with GCC. I'll dig into the source of the test failure some more... it could be a problem with the test logic, not sure yet. But I don't see why that needs to block this PR.

$ mpif90 --version
GNU Fortran (GCC) 7.0.0 20161112 (experimental)

Approved with PullApprove

@zbeekman zbeekman merged commit 3e92b21 into master Nov 12, 2016
@zbeekman zbeekman deleted the wrong-preproc-sym-4-haveint-128 branch November 12, 2016 23:35
@rouson
Copy link
Member

rouson commented Nov 14, 2016

@zbeekman the latest virtual machine now has a trunk build from the same date cited above and I too confirm that there is only one test failure, but my recollection is that it's a different than the once cited above. In the virtual machine, the one test that fails is get_with_offset_1d.f90, which runs as test 9. I thought that was the same one you were seeing on OS X. No?

@vehre, I believe my previous problems with building in the virtual machine related to some twisted logic in the environment modules. The problem has mostly been sorted out. Would it be helpful for you to download the latest virtual machine and run our tests with the default compiler in the sourcery student account? At least then you would be able to reproduce the exact scenario that I'm seeing.

@vehre
Copy link
Collaborator

vehre commented Nov 14, 2016

@rouson @zbeekman and I figured, that the testcase # 9 is missing synchronisation. Because everything is parallel in the testcase, the image doing the check, might start doing so before the data has been set by the remote images. So there may still be an issue with the sync all. I will download newest VM and recheck.

@zbeekman
Copy link
Collaborator Author

@vehre After looking at the standard some more and discussing with others, I'm not convinced that the issue is with missing synchronization. Gets, e.g., local_b(:) = remote_a(:)[me+1] should be blocking on the local (me) image. So either the sync all is failing before the get, or there is some deeper issue.

Also, I don't think we need explicit synchronization at the end of the test: shouldn't the runtime ensure the buffers are flushed before terminating the program? I think I saw some comments in caf_mpi.c that may pertain to this---perhaps it's not implemented yet.

@zbeekman
Copy link
Collaborator Author

I'll do some more testing and reading up on the standard to confirm this, but I think what I said is true.

@rouson
Copy link
Member

rouson commented Nov 14, 2016

I just reviewed the test. Although I could read the standard some to confirm, my experience with CAF indicates to me that the test is correct and the failure of the test represents a bug either in gfortran or in OpenCoarrays. To corroborate that, I'll mention that @afanfa has previously communicated to me that the standard requires "gets" are blocking, which really is the only thing that could make sense, given the lack of a mechanism in Fortran 2008 for checking whether the communication has completed.

@zbeekman, you raise a very interesting and subtle point regarding whether the second synchronization is needed. I have generally assumed it is, but you caused me to think more about it and I'm not certain for the following reason:

program termination is a three step process: (1) initiate termination, (2) synchronize, (3) terminate. I wonder if an error stop is able to cause error termination of all images during step (2). My suspicion is yes, in which case the second synchronization is not needed.

I'll be seeing Bill Long of Cray, Inc., later today and hope to have a moment to probe him on this. Cray invented coarrays. Bill was the lead author on the TS that added the new parallel features in Fortran 2015 and I assume he was involved in designing the coarray features of 2008 so he's an authority.

@rouson
Copy link
Member

rouson commented Nov 14, 2016

I'm going to refactor the test to make the code more compact and modern, e.g., using an array statement instead of a loop. My changes should not affect the outcome. At some point soon, I should go through all the tests and make similar changes.

@vehre
Copy link
Collaborator

vehre commented Nov 15, 2016

Now, as I have explained already, is testcase # 9 using an allocatable array on the lhs, namely the b. This mandates the use of the new API. The new API publishes the array descriptors as mpi-windows in this case the "a". Those array descriptors live on the stack of the main program. Don't mix up array descriptors and array data. The array data is allocatable and allocated using ALLOCATE(). The array descriptor is the meta-data giving the rank the extends in each dimension and the datatype of the array content.

Now when an image leaves the main program, its stack is voided and caf_finalize() is called, which overwrites the memory published as mpi-window. This can cause a crash or inconsistent data.

In principle is the calling sequence starting from invocating the program from a shell in every Fortran program like this:

int main(int argc, char **argv) {
gfortran_caf_init (&argc, &argv);
// do some other init steps.
MAIN
_ (); // This is the Fortran main program which creates a new frame on the stack as explained above
_gfortran_caf_finalize ();
return 0;
}

void MAIN__() {
array_descriptor_1 a; // declare the array descriptor of array a on the stack.

a.data = 0B; // Set that the allocatable array is unallocated.
// Do the other code of the main program.
}
In the MAIN__ the array descriptor lives on the stack. As @zbeekman pointed out is the runtime responsible for synchronisation at program termination. It does so, but in a different function which then corrupts the array descriptor published by overwriting the stack with its own data.

I am still thinking of how to fix this. Approaches may be:

  • make array descriptors of coarrays implicitly save,
  • add an synchronisation at the end of the function.

I don't like the latter, because it may add to many synchronisation points. What are your opinions?

  • Andre

@zbeekman
Copy link
Collaborator Author

I am still thinking of how to fix this. Approaches may be:

  1. make array descriptors of coarrays implicitly save,
  2. add an synchronisation at the end of the function.

I don't like the latter, because it may add to many synchronisation points. What are your opinions?

At the end of which function? MAIN__()? (caf_finalize() already calls MPI_barrier(), however I guess the problem is that the array descriptors have already been freed/gone out of scope.) Is my understanding her correct? Are you referring to adding synchronization at the end of MAIN__()?

Adding synchronization when exiting the Fortran "main" seems like it might be an acceptable solution to me, but I defer to @afanfa's expertise. The only obvious case that concerns me is if there is a failed image, or unexpected termination of some sort on one of the images, that will then cause all the other images to hang in, while waiting for a sync all or MPI_barrier etc. that is never going to come. However, it seems like this may already be the case due to MPI_barrier() in caf_finalize, and I know failed images etc. are part of @afanfa's ongoing work. In addition it seems as though there may already be issues flushing IO buffers, so that approach could kill two birds with one stone. However, I do agree that we should be very careful not to introduce synchronization in excess of those explicitly required by the standard or implicitly required to ensure correct Coarray Fortran semantics.

I'm not sure I fully understand the implications of approach # 1: the array descriptors (and data, right?) would stick around until the barrier in caf_finalize is hit. Is there a draw back to this? It seems to me as though this might be preferable due to one less synchronization.

@vehre
Copy link
Collaborator

vehre commented Nov 15, 2016

At the end of which function? MAIN__()? (caf_finalize() already calls MPI_barrier(), however I guess the problem is that the array descriptors have already been freed/gone out of scope.) Is my understanding her correct? Are you referring to adding synchronization at the end of MAIN__()?

Yes, your understanding is correct. In this special case yes to MAIN__(). But in fact would that approach need synchronisation in every function that has coarray stack variables, which makes the approach not feasible.

I did some research what it would need to make the variable static: exactly one statement in the compiler and instantly the issue is gone.

Btw, on my system the testcases changed their numbering: get_with_offset_1d is now # 10. With my newest copy of git:///vehre/coarray only testcase get_self/sameloc.f90 fails.

I'm not sure I fully understand the implications of approach # 1: the array descriptors (and data, right?) would stick around until the barrier in caf_finalize is hit. Is there a draw back to this? It seems to me as though this might be preferable due to one less synchronization.

Right, they would go to an rss-segment in the executable and no longer on the stack. Because gfortran does not have this variable declared as SAVE, the variable will be reinitialized upon reentering a function. Shoot, that's the next issue. Imagine this is done in a loop. Oh no, what a nightmare.

@zbeekman
Copy link
Collaborator Author

zbeekman commented Nov 15, 2016

Btw, on my system the testcases changed their numbering: get_with_offset_1d is now # 10. With my newest copy of git:///vehre/coarray only testcase get_self/sameloc.f90 fails.

yes, sorry for the confusion surrounding the numbering, I re-enabled some more tests in 76c4492 I've got a patch for building GCC from trunk on OS X now, so I'll take another look at the test failures. I'm a little bit surprised the issue you mentioned with array descriptors on the stack getting corrupted by caf_finalize has not shown up sooner... I'm going to try to find some time to scrutinize the tests more carefully.

@vehre
Copy link
Collaborator

vehre commented Nov 15, 2016

I'm a little bit surprised the issue you mentioned with array descriptors on the stack getting corrupted by caf_finalize has not shown up sooner.

It couldn't. Only with introducing the new API (the caf*_by_ref()-functions) those descriptors need to be published. The old API did not need the descriptor present any time, because those needed to in-sync on all images, so publishing them was not necessary. And with allowing reallocation of the destination buffer by the API-function this information needs to be present (and changed).

@vehre
Copy link
Collaborator

vehre commented Nov 16, 2016

I'm a little bit surprised the issue you mentioned with array descriptors on the stack getting corrupted by caf_finalize has not shown up sooner...

Those are needed only by the new API of caf*_by_ref() functions. Because this API allows reallocation of result arrays the descriptor needs to be fixed accordingly, which of course needs the descriptor to be available. Furthermore does the new API need the descriptor for reffing into the arrays. The old API implemented reffing only for arrays, that are same sized on all images. Therefore supplying the image's local descriptor to the communication-routine was sufficient. But with the support for allocatable components, each component can have a different size on the different images, which enforces the need for the array descriptor to be accessible. While designing the new API we tried to be as universal as possible and therefore the array descriptor is needed whenever a coarray is registered, even when the descriptor is never used.

@vehre
Copy link
Collaborator

vehre commented Nov 16, 2016

I have analyzed the issue in sameloc further. With introducing the new API sendget(_by_ref) is used, whenever a coarray is present on the left- and right-hand side of an assignment. Debugging into _caf_sendget() I do not see the stride respected properly. I therefore deem the bug in _caf_sendget(). I tried to fix this by branching to _caf_get() when the image_index_s is this_image. This helped for cases where only get() was involved, unfortunately is also a send involved.

@afanfa
Copy link
Contributor

afanfa commented Nov 16, 2016

FYI: caf_send does not support the "advanced" strided transfer support implemented in caf_get and caf_send. When a strided transfer is required, caf_sendget performs the transfer element-by-element.

@zbeekman
Copy link
Collaborator Author

@afanfa does STRIDED need to be defined to enable the efficient strided transfers? If so is there a reason to ever disable it?

@vehre
Copy link
Collaborator

vehre commented Nov 17, 2016

@afanfa : I totally understand that caf_sendget does not have an optimized version for strided access, but I claim that caf_sendget does not work at all with strided access. IMO does it not work when the stride is in the get-part nor when it is in the send part.

I have attached a testcase, that shows my claim.
stridedsendget.f90.gz
I had to gzip it or github won't let me attach it.

@afanfa
Copy link
Contributor

afanfa commented Nov 17, 2016

@zbeekman: yes, STRIDED must be defined to enable the efficient strided transfer (but in caf_sendget is not implemented). You want to disable it when you are dealing with strided transfers of derived type coarrays. In that case, the library is forced to send the data byte-by-byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants