Description
openedon Sep 11, 2024
OS: Linux
Python 3.8
dvc 3.55.2
dvc-azure 3.1.0
dvc-data 3.16.5
dvc-http 2.32.0
dvc-objects 5.1.0
dvc-render 1.0.2
dvc-studio-client 0.21.0
dvc-task 0.4.0
Hi,
I come across a special case where the dvc push
fails due to hosting the remote on a network shared drive (Samba Drive).
During pushing the files from the local cache to the remote storage, DVC will reach copyfile
here and attempt to do CoW (Copy on Write) here but due to the fact that both (cache and remote) are on different file systems, it will fail and try to unlink here the linking happened due to calling fcntl.ioctl()
here and then copyfile
will try to copy the file using shutil.copyfile(src, dst)
here. At this point, the call fails due to FileNotFoundError
exception as the dst
is seen as file (os.path.exists(dst)
returns True
) but it is not actually found on the filesystem.
it seems that due to network latency or caching from SMB drive, the unlinking here is not reflected immediately and I've to wait few milliseconds until the dst
path is removed and could be seen as not found on the filesystem (os.path.exists(dst)
returns False
). Then, normal copying using shutil.copyfile(src, dst)
could work fine.
I've written a simple script by extracting the code snippets from lib to reproduce the error.
import os
import shutil
from secrets import token_urlsafe
import fcntl
import time
FICLONE = 0x40049409
def reflink(src, dst):
src_fd = os.open(src, os.O_RDONLY)
try:
dst_fd = os.open(dst, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o666)
except OSError:
os.close(src_fd)
raise
try:
fcntl.ioctl(dst_fd, FICLONE, src_fd)
except OSError as e:
print(">>>", e)
os.close(src_fd)
os.close(dst_fd)
try:
os.unlink(dst)
### Temp fix, comment below code to reproduce the error ###
while os.path.exists(dst): # Check if file still exists
time.sleep(0.1)
print(f"{dst} still exists after 0.1 second")
### Temp fix, comment above code to reproduce the error ###
print(dst, "exists: ", os.path.exists(dst)) # after unlink, the `dst` shouldn't be found
except OSError:
print(">>> failed to unlink")
pass
raise
else:
os.close(src_fd)
os.close(dst_fd)
def tmp_fname(prefix: str = "") -> str:
"""Temporary name for a partial download"""
return f"{prefix}.{token_urlsafe(16)}.tmp"
rpath = "/mnt/interior_sensing/test-dvc/randombinfile.bin" #dest Samba drive
parent = "/mnt/interior_sensing/test-dvc"
lpath = "./randombitsfile.bin" # local filesystem
tmp_file = os.path.join(parent, tmp_fname())
# create random file
fileSizeInBytes = 1024*1024
with open(lpath, 'wb') as fout:
fout.write(os.urandom(fileSizeInBytes))
try:
reflink(lpath, tmp_file)
# NOTE: reflink may or may not clone src permissions
os.chmod(dest, 0o666 & ~umask)
os.replace(tmp_file, rpath)
print("Reflink Successfully !!!")
exit()
except OSError:
pass
# if reflink is succeeded, code will not reach here. Otherwise, do normal copy
shutil.copyfile(lpath, tmp_file)
os.replace(tmp_file, rpath)
Error output:
>>> [Errno 18] Invalid cross-device link
/mnt/interior_sensing/test-dvc/.a4qKiU5jH5cZ124XMHY-dg.tmp exists: True
Traceback (most recent call last):
File "/home/user/env/test_dvc_script.py", line 67, in <module>
shutil.copyfile(lpath, tmp_file)
File "/home/user/env/lib/python3.9/shutil.py", line 266, in copyfile
with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/interior_sensing/test-dvc/.a4qKiU5jH5cZ124XMHY-dg.tmp'
Output after applying the temp fix
>>> [Errno 18] Invalid cross-device link #due to the fact that src, dst are not different filesystems
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp still exists after 0.1 second
/mnt/interior_sensing/test-dvc/.k28cCuZ4Tx5SFpVu-dn9RQ.tmp exists: False
(env) user@machine:~/workspace$ ls /mnt/interior_sensing/test-dvc/
randombinfile.bin # file copied successfully
I'm not sure if this a case that should be handled or it is a case specific to my setup, but I thought it is worth mentioning.