-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rmtreesafe seems to fail when there's an NFS silly rename #551
Comments
Hi, we've seen this on some occasions and it does not have an easy fix. Many cases happen when many files are deleted...(this occurs when restarting a protocol and all existing files are removed). If any of those files are in use by other protocols Scipion should not allow you to restart. If the user has opened (for example, Chimera with a volume that is under the protocol folder you are trying to restart... there is a conflict. I'm not sure if NFS will be aware of this case and will not allow a "full wipe-out" of the protocol folder. In other cases, I've seen just nfs not allowing the deletion of files when the deletion is massive. I'm not an expert in NFS but I think we can't do much about this. We use it daily at the lab and have no issues, although not sure about the version and hardware. Is this something NFS 4 will not suffer from? Maybe we can do something about this? Are you able to know which original file was the one that ended up with the silly name? I'll be happy to trace this case down and see if there is something we can do about this but need a reproducible scenario we don't have. |
Hi @pconesa,
This is a fairly common scenario with NFS. It can be reproduced in bash/python with the following commands:
NFSv4 is stateful and uses file locking/delegations instead (see RFC-7530), so in theory it shouldn't need to produce silly renames (NFSv4.1 seems to get rid of them in RFC-8881). However, if one server has a lease for a file and another server attempts to delete that file, the deletion should still fail and it seems that at least some implementations of NFS v4 still produce silly renames. The problem that silly renames solves (how to ensure that a file that multiple clients are accessing adheres to Unix filesystem semantics) should be fairly universal for POSIX-compliant network filesystems, so I don't think this is avoidable regardless of whether we're talking about NFS or another distributed filesystem (e.g., Lustre uses a distributed lock manager to deal with the same issue).
On the host that still has the file open, we can see which process led to the silly rename by running
I am unaware of a way to show what path a silly renamed file was originally associated with.
The solution I would propose would be to keep track of which batch jobs (in a cluster environment) might be using the directory/files in question, and then to cancel these jobs before attempting to remove the directory. If this error occurs outside of a cluster environment, then the solution would be to determine which processes on the client are keeping the file open and killing them instead of trying to run an unlink()/delete against the silly rename. |
Thanks @jpellman for the detailed explanation. We have reviewed several times this case and ended up assuming it is something external we can't deal with... but may be we can. May be we can tolerate this case (when deleting a bunch of files at protocol restarting) allowing for status we haven't considered before. In case it is even scipion the one blocking a file.... is there a chance to know which command corresponded to that PID? One important question.... is it reproducible? This will give us a hint if we are actually provoking this case under certain steps. One obvious case we can think of is: 1.- User runs the protocol XXX (this produces its own folder with all the outputs. |
One of our researchers has gotten the following error:
It looks like the issue is that a file is open on multiple servers at the same time, the file is deleted, and then pyworkflow tries to delete the silly name associated with the deleted file but can't since the file is still open elsewhere.
This issue occurred specificially when using Scipion on a SLURM cluster (i.e., multiple jobs were accessing the same storage in parallel) using stateless NFSv3.
The text was updated successfully, but these errors were encountered: