Skip to content

Conversation

@SGallagherMet
Copy link

@SGallagherMet SGallagherMet commented Sep 18, 2025

Use custom cylc triggers to skip 'baking' when 'parbake' didn't find any recipes to configure.

Contribution checklist

Aim to have all relevant checks ticked off before merging. See the developer's guide for more detail.

  • Documentation has been updated to reflect change.
  • New code has tests, and affected old tests have been updated.
  • All tests and CI checks pass.
  • Ensured the pull request title is descriptive.
  • Conda lock files have been updated if dependencies have changed.
  • Attributed any Generative AI, such as GitHub Copilot, used in this PR.
  • Marked the PR as ready to review.

@SGallagherMet SGallagherMet self-assigned this Sep 18, 2025
@SGallagherMet
Copy link
Author

@jfrost-mo - I've not asked for a review yet because I haven't tested it properly, but can you see any problems with this?

@jfrost-mo
Copy link
Member

Seems like a sensible approach, as compared to just making the bake task succeed if it has no recipes, this will avoid having to request a large job only to not use it.

@jfrost-mo jfrost-mo added this to the CSET v25.11.0 milestone Oct 28, 2025
Copy link
Member

@jfrost-mo jfrost-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions, but otherwise looks good. After some testing I'd be happy for this to be merged.

@jfrost-mo jfrost-mo marked this pull request as ready for review November 11, 2025 09:05
@jfrost-mo
Copy link
Member

I'll give this a test and merge it if it looks good, ahead of our next release.

@jfrost-mo
Copy link
Member

Giving this a test today, the bake_recipes task still remain in the graph, but doesn't block the flow. The task does block the workflow from finishing however.

Screenshot from 2025-11-25 14-28-53

@jfrost-mo
Copy link
Member

jfrost-mo commented Nov 25, 2025

I'm a bit tempted to go for an alternative implementation where we unconditionally run the baking tasks. It would be less efficient, but it should be simpler in the workflow.

It depends on the cost of getting a job that doesn't do anything through the queue.

@jfrost-mo
Copy link
Member

For long trials I can see the null jobs actually taking a fair amount of queuing, as you will have one per cycle.

@jfrost-mo
Copy link
Member

jfrost-mo commented Nov 25, 2025

I can get it working with suicide triggers, but I'd like the avoid them if possible, as they are more of a cylc 7 thing.

@jfrost-mo
Copy link
Member

jfrost-mo commented Nov 25, 2025

The issue is that bake_recipes is not an optional task, so although the flow continues onwards past it, it still remains in the graph blocking causing an error.

This is mainly to ensure we have validated our environment before using
it.
@jfrost-mo
Copy link
Member

I've done some testing around this, and the current behaviour (since we introduced parbaking) is actually that the bake/bake_aggregation tasks will succeed even if they have no recipes. Instead, the parbaking task has its own check that at least one recipe is enabled, even if it is not the aggregation type being produced.

So the use case if disabling all non-aggregation tasks is already supported, though it does run some baking jobs that immediately finish.

How big a problem that is is what is in question. I'm tempted to close this pull request on the bases that I'm not sure the added complexity to the workflow is worth the savings. This is based on my testing today, where the slurm job for the baking only queued for about 10 seconds, and finished within 2 seconds when there were no recipes.

However, I haven't been using much compute over the past few days, so my fair-share is in good standing.

@jfrost-mo jfrost-mo removed this from the CSET v25.11.0 milestone Nov 25, 2025
@SGallagherMet
Copy link
Author

So, from https://cylc.github.io/cylc-doc/stable/html/user-guide/writing-workflows/scheduling.html#dependencies-with-multiple-optional-outputs
"If one task succeeds and the other fails, then the task ... will be left with one satisfied and one unsatisfied dependency. This will cause the workflow to stall".
With my original graphing bake_recipes is left with fetch_complete:suceeded completed and parbake_recipes:start_baking incomplete.
If you wanted to avoid a suicide trigger you could move parbake_recpes to after the fetch_complete

fetch_complete => parbake_recipes:start_baking? => bake_recipes?
parbake_recipes:skip_baking? | bake_recipes? => cycle_complete

@SGallagherMet
Copy link
Author

SGallagherMet commented Nov 25, 2025

With regards to whether this is worth it though, the problem is with a long trial (say 3*30*24=2160 cycles) where you end making a lot of slurm requests that do nothing. If you have a large domain where the resource requests get big that could start causing problems with fairshare. (I don't have direct evidence of it happening in CSET, but we do see it happen in the RES sometimes)

Thinking about it again, I'm not convinced this is the correct solution. You don't really want the workflow to run at all if it's not going to do anything (or you just want it to run the fetch and aggregation). Some version #1720 that checks the recipes before the suite is even installed is probably a better solution.

@jfrost-mo
Copy link
Member

It probably is worth it then, but yes, there may be better ways.

Given we do want the data to be fetched, the remaining tasks running per-cycle would be:

  • fetch_data - cheap (1 CPU, 2GiB RAM), fast when fetching from the filesystem, slow when fetching from MASS.
  • parbake_recipes - cheap (1 CPU, 2GiB RAM), fast (sub 10 second), and with no inter-cycle dependency.
  • bake_recipe - expensive (32 CPU, 112 GiB RAM), fast (sub 10 seconds) with no recipes.
  • A couple skip mode tasks used for scheduling. These wouldn't be a problem even in their thousands.

So the tasks we want to eliminate are the parbaking and especially the baking. In principle there is no reason the parbaking has to happen in the cycle it is for, as it has no dependency on other tasks (other than the initial environment setup). So we could move to parbaking for all cycles in the first cycle, which would save us a slurm job per cycle.

The same thing could be said about the baking task TBH. Once we have parbaked the recipes, the only cycle specific thing it does is to look in the appropriately dated folder for the recipes to bake. And that is only so we don't end up baking the same recipe multiple times. We could ditch cycling entirely, and just chunk the work up for parallelism regardless of when it was for. That might make it harder to support running in read-time, though I'm not convinced there is much of a driving motivation to do that with CSET.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants