Skip to content

Added a flag to cache intermediate output #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

prismofeverything
Copy link
Contributor

Hello,

Over here at OHSU we are preparing to launch the SMC RNA Dream Challenge, where contributors will be submitting their entries as CWL workflows. These workflows have many steps, and while developing the submission process we find ourselves having to rerun workflows repeatedly in order to work out the bugs. Most of these steps are quick, but one in particular takes a long time (~8 hours), which has been a huge limitation on our iteration cycle.

This patch adds a flag to cwltool, --cache-intermediate-output, which will cache the output of one step once completed and reuse the cached output instead of running the step over if the workflow is run again. For now this is just in a cache subdirectory in the output directory, which makes it easy to delete if you want to run it again.

Identity between runs is just based on the command line for that step. As I suspected people might want to base identity on other things (like a hash of the input files possibly) I made the identity function a parameter to main in the kwargs under generate_identity. Though for our purposes it is a reasonable assumption that if someone is using the --cache-intermediate-output flag they want to cache the output, so just using the command line as identity is sufficient, but maybe in the future someone will want to base identity on something more specific.

I also added a simple two step workflow (revsort-workflow.cwl) and a one line script to test it (cache-test.sh) to the tests directory to verify it works. All feedback welcome, thanks!

@tetron
Copy link
Member

tetron commented Apr 1, 2016

Great! I'll need some time to review.

@tetron
Copy link
Member

tetron commented Apr 4, 2016

I'd like to see the implementation tweaked a little bit:

a) --cache-intermediate-output should accept a directory supplying the cache directory

b) The cache directory should be used as tmp_outdir_prefix kwarg when creating the workflow job. This will cause the intermediate output directories to be created under the cache directory

c) Before running a job, generate the command line identity hash. Check if there is a file by that name that in the cache directory. If present, this file should contain the JSON output object from the cached execution, which you can read and return.

d) If not, run the job. If the job completes successfully and the output object is valid, create a file in the cache directory named with the command line identity hash and containing the JSON output object .

The function for generating the command line identity hash would probably work a bit better with an approach based on hashing the the contents of the "builder" object, but I'll have to give it more thought.

@prismofeverything
Copy link
Contributor Author

Hi Peter,

Thanks for taking a look! I have some questions about your requests.

a) Certainly, this is something we had talked about doing.

b) Reading the code, tmp_outdir_prefix is ignored if outdir is supplied, so won't this cause cwltool to bypass the cache if outdir is supplied? Even using the outdir for the job (rather than the outdir for the whole workflow) would fail as it is different every time the workflow is run, and for the cache to work we must be able to find it between runs.

Something else I'm missing here?

c,d) You are suggesting that our output files are cached by virtue of the cache directory being tmp_outdir_prefix, so that all we need to write beyond that is a JSON output object - which output object you are referring to? The one returned by collect_outputs is passed along independent of whether the cache is used or not, so I am assuming you mean some other output object... or I may be missing the significance of the one returned by collect_outputs beyond being a map of paths.

Either way, for the reasons above (b) I don't see how we can reliably use tmp_outdir_prefix when it is overridden by outdir. Any help appreciated there!

In the current implementation I am just copying whatever appears in the outdir for the job into the cache. Since each job has its own outdir which is distinct from the overall outdir for the workflow, whatever appears in there is guaranteed to be produced by the job in question, correct?

This is the least invasive approach I was able to come up with, in particular preserving the call to collect_outputs so that the system does not care whether the outputs are a result of the job running or simply copied from the cache. Since the result of each job is simply its outputs, this is the narrowest scope the cache could operate and the least impact on the existing system.

The cachedir can be anywhere as long as it is persistent between jobs and accessible after the workflow is complete.

Open to all perspectives otherwise of course! Thanks for taking the time to work through it.

@cwl-bot
Copy link

cwl-bot commented Apr 15, 2016

Can one of the admins verify this patch?

@mr-c
Copy link
Member

mr-c commented Apr 15, 2016

add to testlist

@prismofeverything
Copy link
Contributor Author

Okay, I have resolved all conflicts with master. Please let me know the answers to my questions above or any other concerns you have with this patch. We have already been using this branch successfully in the wild for a couple weeks now and found it critical in our development process for the SMC Dream Challenge. Would be great to have it be a real feature of cwltool, thanks!

@tetron
Copy link
Member

tetron commented Apr 15, 2016

Thanks! I won't be able to look at it until the middle of next week. I agree this is a useful feature so I may branch off your PR to tweak the implementation a bit.

@tetron
Copy link
Member

tetron commented Apr 28, 2016

Hi @prismofeverything @kellrott I used this branch as a basis for an expanded caching implementation, which is PR #71, so I'm closing this PR.

@tetron tetron closed this Apr 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants