-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clear spec cache to avoid sporadic mismatch #431
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #431 +/- ##
=======================================
Coverage 91.99% 92.00%
=======================================
Files 75 75
Lines 13499 13504 +5
=======================================
+ Hits 12419 12424 +5
Misses 1080 1080
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I don't see how this is a race condition. We don't have multiple cores modifying the same area of memory here, do we? Also why isn't the cache self._array being cleared between python runs? I would have thought that it would be empty at the start of a new run. If the Fortran arrays in the cache aren't being freed won't this lead to a memory leak? |
@andrewgiuliani you are right, this is not a race condition, excuse my ignorance. F90wrap works in a funny way (I don't completely understand it), but the cache is only cleared when the python kernel quits. It sets up links to shared FORTRAN memory that persist until the kernel exits. Normally you only run a single simulation, and the kernel quits when the optimisation has finished, it is only in testing that many many many SPEC instances are created and destroyed. For example in
the Fortran arrays in myspec1 and myspec2 are the same. This is a limitation of f90wrap and the above code should not be used. The best we can do is make sure that loading a new state runs correctly. Yes this is a memory leak when many Spec objects are initialized in the same kernel session, but I don't see how to avoid it. |
When you clear the cache, is it possible to run a call on the FORTRAN side to deallocate the allocated arrays referenced in the cache? If not, I would suggest that this anomalous behaviour should be mentioned in a docstring somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above for comments
@andrewgiuliani This is not possible with the current implementation of f90wrap, and I am not sure if desired. It is meant to wrap existing programs, giving you access to the memory as they run, with the possibility of muting existing arrays. Forcing deallocation on the FORTRAN side from the python side would alter the program it is wrapping. Python keeps a dictionary of arrays that have been referenced, but if a new array has the same handle, the dictionary is not updated, and referencing it out of the python-defined bounds (not FORTRAN) results in the error we see. It is certainly a bug in the implementation that a handle collission (that is the right term!) is likely to occur, if arrays are deallocated and immediately afterwards allocated on memory-constrained systems, but this is at the core of some design choices made for f90wrap, and not something I feel comfortable fixing. The current PR fixes the anomalous behavior that was occurring (that the cache kept accumulating), and this fix, though a bit draconian, prevents such behavior (and since the collision could occur for every accessed array, it is the right thing to do). I will probably open an issue on the upstream f90wrap, but this will take a while as the whole numpy2.0 migration has left me very little time for science, and I have some QUASR configurations to calculate tangles in! ;) edit: actually, this just overwrites the existing dictionary with an empty dictionary. I assume the Python garbage collector will delete it as there are no more references to it. Could someone with more python knowledge comment if this is the right thing to do (@mbkumar)? |
I would have thought that a deallocate all function could be written in on the Fortran side and called from python. In any case I think that this is great that the bug has been identified! |
@smiet , can we have a zoom call to understand what is going on? I am not getting the full picture from your description. Can't we call python garbage collector on the dictionary you are resetting? |
At the least can you talk to the f90wrap developer, who might have encountered this issue before? There should be some mechanism in f90wrap to clear the cache. |
@mbkumar yes, let's zoom tomorrow (Monday), was away from internet this weekend. Can you set it up? I'm having a little trouble with email on my phone right now, available all your morning |
I'd like to be present during the zoom too to better understand the problem |
Some notes of the discussion resulted in issue #434. All tests are passing consistently, this PR doesn't solve the root cause, but it applies a fix to stop a known bug. I suggest we move further discussion of this topic to #434 and merge this fix so we can stop seeing sporadic failures in our CI. @mbkumar @landreman @andrewgiuliani |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for figuring out a fix for this tricky bug!
Edited for improved explanation
This should fix the sporadic failure in the CI #357
The issue: f90wrap keeps a cache on the python side for the F90wrapped FORTRAN arrays, and the logic is as follows:
Now for the handle collision: If spec has run before, and the array has been accessed before (put in cache), and by happenstance in the second run, an array is placed in the same memory location as before, a handle collision occurs, and python assumes the array is identical to the one it accessed before (in shape and size too).
When python then tries to access this array, it can actually be a different array that just happens to be placed in the same location, or it is the right array, but the shape on python's side is not updated and we get an out of bounds error when trying to read outside the python-defined bounds.
The handle is basically the pointer to the array, cast as an int. It is unlikely to be exactly the same EXCEPT when a the code runs over and over again, and a conveniently shaped hole has just been emptied in memory by deallocation.
This also explains why the fault is mostly seen in CI, as they have much smaller memories, and are more likely to write to the same location. But this is handled by the CPU and not reproducible.