Added a flag to cache intermediate output #59

prismofeverything · 2016-04-01T17:17:24Z

Hello,

Over here at OHSU we are preparing to launch the SMC RNA Dream Challenge, where contributors will be submitting their entries as CWL workflows. These workflows have many steps, and while developing the submission process we find ourselves having to rerun workflows repeatedly in order to work out the bugs. Most of these steps are quick, but one in particular takes a long time (~8 hours), which has been a huge limitation on our iteration cycle.

This patch adds a flag to cwltool, --cache-intermediate-output, which will cache the output of one step once completed and reuse the cached output instead of running the step over if the workflow is run again. For now this is just in a cache subdirectory in the output directory, which makes it easy to delete if you want to run it again.

Identity between runs is just based on the command line for that step. As I suspected people might want to base identity on other things (like a hash of the input files possibly) I made the identity function a parameter to main in the kwargs under generate_identity. Though for our purposes it is a reasonable assumption that if someone is using the --cache-intermediate-output flag they want to cache the output, so just using the command line as identity is sufficient, but maybe in the future someone will want to base identity on something more specific.

I also added a simple two step workflow (revsort-workflow.cwl) and a one line script to test it (cache-test.sh) to the tests directory to verify it works. All feedback welcome, thanks!

… if the arguments have not changed

tetron · 2016-04-01T21:25:11Z

Great! I'll need some time to review.

…r SMC

tetron · 2016-04-04T03:20:19Z

I'd like to see the implementation tweaked a little bit:

a) --cache-intermediate-output should accept a directory supplying the cache directory

b) The cache directory should be used as tmp_outdir_prefix kwarg when creating the workflow job. This will cause the intermediate output directories to be created under the cache directory

c) Before running a job, generate the command line identity hash. Check if there is a file by that name that in the cache directory. If present, this file should contain the JSON output object from the cached execution, which you can read and return.

d) If not, run the job. If the job completes successfully and the output object is valid, create a file in the cache directory named with the command line identity hash and containing the JSON output object .

The function for generating the command line identity hash would probably work a bit better with an approach based on hashing the the contents of the "builder" object, but I'll have to give it more thought.

prismofeverything · 2016-04-04T19:16:48Z

Hi Peter,

Thanks for taking a look! I have some questions about your requests.

a) Certainly, this is something we had talked about doing.

b) Reading the code, tmp_outdir_prefix is ignored if outdir is supplied, so won't this cause cwltool to bypass the cache if outdir is supplied? Even using the outdir for the job (rather than the outdir for the whole workflow) would fail as it is different every time the workflow is run, and for the cache to work we must be able to find it between runs.

Something else I'm missing here?

c,d) You are suggesting that our output files are cached by virtue of the cache directory being tmp_outdir_prefix, so that all we need to write beyond that is a JSON output object - which output object you are referring to? The one returned by collect_outputs is passed along independent of whether the cache is used or not, so I am assuming you mean some other output object... or I may be missing the significance of the one returned by collect_outputs beyond being a map of paths.

Either way, for the reasons above (b) I don't see how we can reliably use tmp_outdir_prefix when it is overridden by outdir. Any help appreciated there!

In the current implementation I am just copying whatever appears in the outdir for the job into the cache. Since each job has its own outdir which is distinct from the overall outdir for the workflow, whatever appears in there is guaranteed to be produced by the job in question, correct?

This is the least invasive approach I was able to come up with, in particular preserving the call to collect_outputs so that the system does not care whether the outputs are a result of the job running or simply copied from the cache. Since the result of each job is simply its outputs, this is the narrowest scope the cache could operate and the least impact on the existing system.

The cachedir can be anywhere as long as it is persistent between jobs and accessible after the workflow is complete.

Open to all perspectives otherwise of course! Thanks for taking the time to work through it.

cwl-bot · 2016-04-15T17:52:41Z

Can one of the admins verify this patch?

mr-c · 2016-04-15T17:57:38Z

add to testlist

prismofeverything · 2016-04-15T19:01:37Z

Okay, I have resolved all conflicts with master. Please let me know the answers to my questions above or any other concerns you have with this patch. We have already been using this branch successfully in the wild for a couple weeks now and found it critical in our development process for the SMC Dream Challenge. Would be great to have it be a real feature of cwltool, thanks!

tetron · 2016-04-15T20:57:58Z

Thanks! I won't be able to look at it until the middle of next week. I agree this is a useful feature so I may branch off your PR to tweak the implementation a bit.

…into cache-intermediate-output

tetron · 2016-04-28T14:04:46Z

Hi @prismofeverything @kellrott I used this branch as a basis for an expanded caching implementation, which is PR #71, so I'm closing this PR.

added a flag to cache intermediate output between steps of a workflow…

e282594

… if the arguments have not changed

some updates to the identity generation and caching based on tests fo…

2b7a461

…r SMC

checking to see if pathmapper has a dirs property

5664b54

prismofeverything added 2 commits April 15, 2016 11:37

resolved conflicts from recent commits

87fd91d

added import builder line back in

5b85d87

prismofeverything added 2 commits April 26, 2016 10:36

Merge branch 'master' of github.com:common-workflow-language/cwltool …

9373624

…into cache-intermediate-output

Merge branch 'master' of github.com:common-workflow-language/cwltool …

8d3b2fc

…into cache-intermediate-output

tetron closed this Apr 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Added a flag to cache intermediate output #59

Added a flag to cache intermediate output #59

Uh oh!

prismofeverything commented Apr 1, 2016

Uh oh!

tetron commented Apr 1, 2016

Uh oh!

tetron commented Apr 4, 2016

Uh oh!

prismofeverything commented Apr 4, 2016

Uh oh!

cwl-bot commented Apr 15, 2016

Uh oh!

mr-c commented Apr 15, 2016

Uh oh!

prismofeverything commented Apr 15, 2016

Uh oh!

tetron commented Apr 15, 2016

Uh oh!

tetron commented Apr 28, 2016

Uh oh!

Uh oh!

Uh oh!

Added a flag to cache intermediate output #59

Added a flag to cache intermediate output #59

Uh oh!

Conversation

prismofeverything commented Apr 1, 2016

Uh oh!

tetron commented Apr 1, 2016

Uh oh!

tetron commented Apr 4, 2016

Uh oh!

prismofeverything commented Apr 4, 2016

Uh oh!

cwl-bot commented Apr 15, 2016

Uh oh!

mr-c commented Apr 15, 2016

Uh oh!

prismofeverything commented Apr 15, 2016

Uh oh!

tetron commented Apr 15, 2016

Uh oh!

tetron commented Apr 28, 2016

Uh oh!

Uh oh!