-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CalcJob
: allow directories in local_copy_list
#5115
CalcJob
: allow directories in local_copy_list
#5115
Conversation
@ramirezfranciscof @mbercx want to give this a look? Note that this includes another commit that is being added separately in PR #5114 but the second commit depends on it. |
Codecov Report
@@ Coverage Diff @@
## develop #5115 +/- ##
===========================================
+ Coverage 80.89% 80.89% +0.01%
===========================================
Files 536 536
Lines 37057 37066 +9
===========================================
+ Hits 29972 29981 +9
Misses 7085 7085
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also good for me! Just two comments to discuss related to the interaction of this PR with the one it is based on.
if data_node.get_object(filename_source).file_type == FileType.DIRECTORY: | ||
# If the source object is a directory, we copy its entire contents | ||
data_node.copy_tree(filepath_target, filename_source) | ||
provenance_exclude_list.extend(data_node.list_object_names(filename_source)) | ||
else: | ||
# Otherwise, simply copy the file | ||
with folder.open(target, 'wb') as handle: | ||
with data_node.open(filename, 'rb') as source: | ||
shutil.copyfileobj(source, handle) | ||
|
||
provenance_exclude_list.append(target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this conditional? I think data_node.copy_tree
should work when it is provided with the path to a single file, no? (actually it might be good to add a test for this specifically in #5114 )
Same question for data_node.list_object_names
; even if it currently doesn't, maybe it would be easier to make that work with paths to files instead of adding complexity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I don't copy_tree
should accept a file for a path
. I will add an explicit check in the other PR and add a test. Same goes for list_object_names
, it currently excepts when you pass a path that references a file, as it should. What should the expected output be in your opinion when passing a file? Seems weird to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm ok, I see. I guess I understand how "listing" is technically different if interpreted as "show the content of" and that doesn't work with files, but "copy" is more general and works with both. Is it the "tree" qualifier that for you implies that what is copied should be a directory and files should be identified and rejected? Because I see no practical complications for copying files and folders indistinctively...
Also, related to this: what would be the problem of doing provenance_exclude_list.append(target)
when target
is a folder? I mean, when you do data_node.list_object_names(filename_source))
, the list you get back also includes subfolders, right? So you are already adding folders to the exclude_list
.
On the other hand, I guess one problem is that if you add a subfolder like that and then create some input file that should be included in the provenance inside of that subfolder, then it will be ignored because the folder is on the exclude list, right? But we are anyways adding subfolders because of previous paragraph, so this problem is more complex and not solved by this if
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm ok, I see. I guess I understand how "listing" is technically different if interpreted as "show the content of" and that doesn't work with files, but "copy" is more general and works with both. Is it the "tree" qualifier that for you implies that what is copied should be a directory and files should be identified and rejected? Because I see no practical complications for copying files and folders indistinctively...
With the name copy_tree
, I think it should definitely not accept files. The tree
indeed signifies that it operates on a directory, like put_object_from_tree
. For files there is put_object_from_file(like)
. It can be argued that there should be a generic method that copies both, but I don't think it necessarily makes things easier to use or implement.
Also, related to this: what would be the problem of doing
provenance_exclude_list.append(target)
whentarget
is a folder? I mean, when you dodata_node.list_object_names(filename_source))
, the list you get back also includes subfolders, right? So you are already adding folders to theexclude_list
.
You are correct. Since I expand the filenames in their separate subpaths when they are copied over to the node's repository, this will also include the subpath that equals target
. So there is indeed no need to expand the target path already here in its subfolders. I will change this.
On the other hand, I guess one problem is that if you add a subfolder like that and then create some input file that should be included in the provenance inside of that subfolder, then it will be ignored because the folder is on the exclude list, right? But we are anyways adding subfolders because of previous paragraph, so this problem is more complex and not solved by this
if
.
This is a good point. I guess this might indeed be problematic, but I don't see a clean solution right now. If the file really needs to be stored in the node's repo, then they should simply write the entire folder directly to the sandbox and not use the local_copy_list
. Anyway, this entire system of the local_copy_list
is not an ideal solution but it is there for historical reasons. Really what we should have done from the beginning is just have the plugin write everything to the sandbox and then specify just the provenance_from_exclude
list. This gives full flexibility in terms of what gets stored and what not and it is a lot simpler to understand. This is definitely a thing we should consider simplifying for v3.0 if that ever comes out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, thinking about it, we cannot simplify this by just adding target
to the provenance_exclude_list
. The reason is that if the target is .
, i.e., add all contents to the root, we don't want to exclude everything that is in there. Just those resources that were written by that operation. So what I had originally, expanding the source path into its contents, is the correct thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I see. I still feel a bit uneasy about this, but I think this is a general problem of implementing file copy interfaces. It kind of reminds me of when I tried to add directory creation to the copy lists and we had to roll it back because it changed the behavior of how some copy commands were indicated, prompting the publication of issue #4566 out of my frustration. What a headache =/...
Anyways, I still think it would be good to do a more thorough design of this and define a specific interface, but that is beyond the scope of this PR, so I'm just going to approve this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the interface is shit, but the real solution is to get rid of local_copy_list
entirely. The remote_copy_list
is slightly different, but ideally that could also be replaced with an interface that allows a user to directly manipulate the remote file system instead of giving abstract indirect instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I get the feeling that you will get this kind of annoying headaches with almost any sort of general "file copying" interface you end up implementing. I mean, just from writing bash scripts with cp
or rsync
I remember having problems with suddenly unexpected behavior related to the exact syntax used in the command and the specifics of the situation. I guess that is the reason why shutil
(or you in the related PR) choose to do copy_TREE
kind of methods that are more restrictive (instead of what I wanted to just copy whatever you provide) but at least the expected result is more clear.
The
remote_copy_list
is slightly different, but ideally that could also be replaced with an interface that allows a user to directly manipulate the remote file system instead of giving abstract indirect instructions.
I don't understand, what do you mean by directly manipulating the remote file system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, what do you mean by directly manipulating the remote file system?
By letting the user call direct methods to write/read files instead of using indirect instructions through remote_copy_list
. It would automatically get a transport, which can then be used to perform the operations directly. This would ideally have an interface that is identical to writing files to the local file system, making it transparent whether you are reading/writing from/to a local or remote file system.
d0a75a7
to
bd1e705
Compare
Up till now, the `local_copy_list` could only be used to specify individual files that should be copied. However, very often, one may have an input node from whose repository an entire directory, or sub directory should be copied. The only way to do this was to manually iterate over the contents of that repository directory and add the files to the `local_copy_list` one by one. Here we extend the syntax of the `local_copy_list` and the second argument can now also point to a directory in the repository of the source node. Its entire contents will be copied to the relative path defined by the third element in the tuple. Note that the directory itself won't be copied, just its contents.
bd1e705
to
5dafa4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good to go, thanks!
Fixes #3869
Up till now, the
local_copy_list
could only be used to specifyindividual files that should be copied. However, very often, one may
have an input node from whose repository an entire directory, or sub
directory should be copied. The only way to do this was to manually
iterate over the contents of that repository directory and add the files
to the
local_copy_list
one by one.Here we extend the syntax of the
local_copy_list
and the secondargument can now also point to a directory in the repository of the
source node. Its entire contents will be copied to the relative path
defined by the third element in the tuple. Note that the directory
itself won't be copied, just its contents.