Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rmtreesafe seems to fail when there's an NFS silly rename #551

Open
jpellman opened this issue Aug 29, 2024 · 3 comments
Open

rmtreesafe seems to fail when there's an NFS silly rename #551

jpellman opened this issue Aug 29, 2024 · 3 comments

Comments

@jpellman
Copy link

One of our researchers has gotten the following error:

Traceback (most recent call last):
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/gui/form.py", line 2186, in _close
message = self.callback(self.protocol, onlySave, doSchedule)
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/gui/project/viewprotocols.py", line 1502, in _executeSaveProtocol
self.project.launchProtocol(prot)
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/project/project.py", line 604, in launchProtocol
protocol.makePathsAndClean() # Create working dir if necessary
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/protocol/protocol.py", line 1432, in makePathsAndClean
self.cleanWorkingDir()
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/protocol/protocol.py", line 1444, in cleanWorkingDir
pwutils.cleanPath(self._getPath())
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/utils/path.py", line 148, in cleanPath
shutil.rmtree(p)
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/shutil.py", line 718, in rmtree
rmtreesafe_fd(fd, path, onerror)
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/shutil.py", line 655, in rmtreesafe_fd
rmtreesafe_fd(dirfd, fullname, onerror)
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/shutil.py", line 675, in rmtreesafe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/opt/scipion/conda/envs/scipion3/lib/python3.8/shutil.py", line 673, in rmtreesafe_fd
os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000ec4bacd60002f975'

It looks like the issue is that a file is open on multiple servers at the same time, the file is deleted, and then pyworkflow tries to delete the silly name associated with the deleted file but can't since the file is still open elsewhere.

This issue occurred specificially when using Scipion on a SLURM cluster (i.e., multiple jobs were accessing the same storage in parallel) using stateless NFSv3.

@pconesa
Copy link
Contributor

pconesa commented Aug 30, 2024

Hi, we've seen this on some occasions and it does not have an easy fix.

Many cases happen when many files are deleted...(this occurs when restarting a protocol and all existing files are removed).

If any of those files are in use by other protocols Scipion should not allow you to restart. If the user has opened (for example, Chimera with a volume that is under the protocol folder you are trying to restart... there is a conflict. I'm not sure if NFS will be aware of this case and will not allow a "full wipe-out" of the protocol folder.

In other cases, I've seen just nfs not allowing the deletion of files when the deletion is massive. I'm not an expert in NFS but I think we can't do much about this.

We use it daily at the lab and have no issues, although not sure about the version and hardware.

Is this something NFS 4 will not suffer from?

Maybe we can do something about this? Are you able to know which original file was the one that ended up with the silly name?

I'll be happy to trace this case down and see if there is something we can do about this but need a reproducible scenario we don't have.

@jpellman
Copy link
Author

Hi @pconesa,

but need a reproducible scenario we don't have.

This is a fairly common scenario with NFS. It can be reproduced in bash/python with the following commands:

# On one server
jpellman@server1:~$ mkdir test
jpellman@server1:~$ cd test/
jpellman@server1:~/test$ python
Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f=open('file1','w')  

# On another server
jpellman@server2:~/test$ rm file1
jpellman@server2:~/test$ ls -la
total 204
drwxr-xr-x   2 jpellman nramm  4096 Aug 30 10:54 .
drwxr-xr-x 127 jpellman nramm 36864 Aug 30 10:53 ..
-rw-r--r--   1 jpellman nramm     0 Aug 30 10:53 .nfs00000000fc272e4c00034ad3

# Back on server1
>>> f.close()

# Back on server2
jpellman@server2:~/test$ ls -la
total 204
drwxr-xr-x   2 jpellman nramm  4096 Aug 30 10:54 .
drwxr-xr-x 127 jpellman nramm 36864 Aug 30 10:53 ..

Is this something NFS 4 will not suffer from?

NFSv4 is stateful and uses file locking/delegations instead (see RFC-7530), so in theory it shouldn't need to produce silly renames (NFSv4.1 seems to get rid of them in RFC-8881). However, if one server has a lease for a file and another server attempts to delete that file, the deletion should still fail and it seems that at least some implementations of NFS v4 still produce silly renames. The problem that silly renames solves (how to ensure that a file that multiple clients are accessing adheres to Unix filesystem semantics) should be fairly universal for POSIX-compliant network filesystems, so I don't think this is avoidable regardless of whether we're talking about NFS or another distributed filesystem (e.g., Lustre uses a distributed lock manager to deal with the same issue).

Are you able to know which original file was the one that ended up with the silly name?

On the host that still has the file open, we can see which process led to the silly rename by running lsof:

jpellman@server1:~/test$ lsof .nfs00000000d5d4a9d600034ae1
COMMAND     PID     USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
python  1732644 jpellman    3w   REG   0,60        0 3587484118 .nfs00000000d5d4a9d600034ae1

I am unaware of a way to show what path a silly renamed file was originally associated with.

Maybe we can do something about this?

The solution I would propose would be to keep track of which batch jobs (in a cluster environment) might be using the directory/files in question, and then to cancel these jobs before attempting to remove the directory. If this error occurs outside of a cluster environment, then the solution would be to determine which processes on the client are keeping the file open and killing them instead of trying to run an unlink()/delete against the silly rename.

@pconesa
Copy link
Contributor

pconesa commented Sep 2, 2024

Thanks @jpellman for the detailed explanation. We have reviewed several times this case and ended up assuming it is something external we can't deal with... but may be we can. May be we can tolerate this case (when deleting a bunch of files at protocol restarting) allowing for status we haven't considered before.

In case it is even scipion the one blocking a file.... is there a chance to know which command corresponded to that PID?

One important question.... is it reproducible? This will give us a hint if we are actually provoking this case under certain steps.

One obvious case we can think of is:

1.- User runs the protocol XXX (this produces its own folder with all the outputs.
2.- User visualize the output (images produced) using CHimeraX .... (that could be launched form Scipion or not)
3.- User does not like the output and decides to re run the protocol XXX but ChimeraX is still open, I believe this will create those files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants