-
-
Notifications
You must be signed in to change notification settings - Fork 232
Added a flag to cache intermediate output #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a flag to cache intermediate output #59
Conversation
… if the arguments have not changed
Great! I'll need some time to review. |
I'd like to see the implementation tweaked a little bit: a) --cache-intermediate-output should accept a directory supplying the cache directory b) The cache directory should be used as c) Before running a job, generate the command line identity hash. Check if there is a file by that name that in the cache directory. If present, this file should contain the JSON output object from the cached execution, which you can read and return. d) If not, run the job. If the job completes successfully and the output object is valid, create a file in the cache directory named with the command line identity hash and containing the JSON output object . The function for generating the command line identity hash would probably work a bit better with an approach based on hashing the the contents of the "builder" object, but I'll have to give it more thought. |
Hi Peter, Thanks for taking a look! I have some questions about your requests. a) Certainly, this is something we had talked about doing. b) Reading the code, Something else I'm missing here? c,d) You are suggesting that our output files are cached by virtue of the cache directory being Either way, for the reasons above (b) I don't see how we can reliably use In the current implementation I am just copying whatever appears in the This is the least invasive approach I was able to come up with, in particular preserving the call to The Open to all perspectives otherwise of course! Thanks for taking the time to work through it. |
Can one of the admins verify this patch? |
add to testlist |
Okay, I have resolved all conflicts with master. Please let me know the answers to my questions above or any other concerns you have with this patch. We have already been using this branch successfully in the wild for a couple weeks now and found it critical in our development process for the SMC Dream Challenge. Would be great to have it be a real feature of cwltool, thanks! |
Thanks! I won't be able to look at it until the middle of next week. I agree this is a useful feature so I may branch off your PR to tweak the implementation a bit. |
…into cache-intermediate-output
…into cache-intermediate-output
Hi @prismofeverything @kellrott I used this branch as a basis for an expanded caching implementation, which is PR #71, so I'm closing this PR. |
Hello,
Over here at OHSU we are preparing to launch the SMC RNA Dream Challenge, where contributors will be submitting their entries as CWL workflows. These workflows have many steps, and while developing the submission process we find ourselves having to rerun workflows repeatedly in order to work out the bugs. Most of these steps are quick, but one in particular takes a long time (~8 hours), which has been a huge limitation on our iteration cycle.
This patch adds a flag to cwltool,
--cache-intermediate-output
, which will cache the output of one step once completed and reuse the cached output instead of running the step over if the workflow is run again. For now this is just in acache
subdirectory in the output directory, which makes it easy to delete if you want to run it again.Identity between runs is just based on the command line for that step. As I suspected people might want to base identity on other things (like a hash of the input files possibly) I made the identity function a parameter to
main
in thekwargs
undergenerate_identity
. Though for our purposes it is a reasonable assumption that if someone is using the--cache-intermediate-output
flag they want to cache the output, so just using the command line as identity is sufficient, but maybe in the future someone will want to base identity on something more specific.I also added a simple two step workflow (
revsort-workflow.cwl
) and a one line script to test it (cache-test.sh
) to thetests
directory to verify it works. All feedback welcome, thanks!