- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
ompi/group: fix race condition in group operations #8547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1d92c7c    to
    9dff148      
    Compare
  
    | bot:aws:retest | 
        
          
                ompi/group/group.h
              
                Outdated
          
        
      | * to NULL. | ||
| */ | ||
| static inline struct ompi_proc_t *ompi_group_dense_lookup (ompi_group_t *group, const int peer_id, const bool allocate) | ||
| static inline struct ompi_proc_t *ompi_group_dense_lookup (ompi_group_t *group, const int peer_id, unsigned char flags) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can still be a const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the passed value is not a pointer, I guess const can safely be ignored,
        
          
                ompi/group/group.h
              
                Outdated
          
        
      | * to reach the process pointer | ||
| */ | ||
| static inline ompi_proc_t *ompi_group_get_proc_ptr (ompi_group_t *group, int rank, const bool allocate) | ||
| static inline ompi_proc_t *ompi_group_get_proc_ptr (ompi_group_t *group, int rank, unsigned char flags) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here - wonder if the const can still be kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters, these arguments are const by default as they cannot be returned to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - just a few questions.
| We threaded very carefully to avoid global locks, a design at the opposite to what this PR does. Moreover, creating and adding a new proc_t (from the sentinel), is an expensive operation (calling into other libraries), and always taking the global lock when manipulating the group will prevent any other proc_t operations from succeeding (even on totally unrelated processes). I understand that due to the use of the sentinel, we need to be careful how we create and add the new proc_t to the group, but at the end we only need to protect the update operations in the hash table while allowing all reads to go unhindered. | 
92ac56b    to
    c98f4a6      
    Compare
  
    | 
 Thanks for the review George, I guess having a lock only during the updates in the proc hash-table might not be sufficient here, because during an MPI group operation (eg,  The snippet pasted below does the comparison of two groups  https://github.com/open-mpi/ompi/blob/master/ompi/group/group_plist.c#L43 Here is a snippet from the test case used to identify the issue, When the above test  is launched with 80 procs, even when both the groups passed to  And similar group comparison code block is present in multiple group operations which will yield the same inconsistent result. If a lock is taken just while updating the proc pointers in the proc hash-table the issue will still persist. Please let me know your suggestions. Thanks. | 
| I saw your issue only after reviewing the code, I would have had a slightly different review based on the code you provided. The real issue with the code is that we are resolving the ompi_proc_t when we should not, because at the end the only information we need is the process name. And we have this in the sentinel, without any need to lock global structures. Thus, we could simply change the way we compare groups, and this would make your example work. I guess we can build a case where multiple threads need to resolve the same ompi_proc_t from two different groups/communicators at the same moment, which will bring us back to the need for some locking mechanism, but I need to revisit the PML code again in order to validate this assumption. | 
| @bosilca Thanks for the suggestion, yes just by changing the way we do the comparison of group by using the process name as here AboorvaDevarajan@3399031 instead of process pointers seems to be fixing this specific issue. I will do some review and push the updates. And also as you pointed out, will try to see if there will be any other race conditions during group allocation. | 
| I looked at the new version and I think the code is cleaner. There might be an impact on the performance, because of the calls to opal_compare_proc, but that could be the topic for another time. | 
c98f4a6    to
    faadba0      
    Compare
  
    | @bosilca I have pushed the changes, since we are just concerned about the equality while comparing the processes here, I'm just checking if it is  | 
faadba0    to
    7e73b07      
    Compare
  
    | Sorry, I did not expected you to do this last change, because I don't think we should do it. Long story short, the naming is something that we are supposed to inherit from the RTE, and we should always manipulate it via the accessors provided. I let the other chime in, but I think everything is good in this PR, with the exception of the removal of the use of opal_compare_proc. | 
To avoid checking sentinel process pointers to the original `ompi_proc_t` pointers compare the processes in the groups using process names. Signed-off-by: Aboorva Devarajan <abodevar@in.ibm.com>
7e73b07    to
    0f2c70c      
    Compare
  
    | sure George, Thanks. reverted back. I guess comparison function can be separately optimized. | 
| bot:aws:retest | 
| bot:aws:retest let's try again... | 
| @AboorvaDevarajan can you please cherry-pick this to v5.0.x? | 
ompi/group: fix proc pointer comparison in groups
To avoid checking sentinel process pointers to the original
ompi_proc_tpointers compare the processes in the groups using process names.Signed-off-by: Aboorva Devarajan abodevar@in.ibm.com