Revamp job IDs #2986

charleskawczynski · 2024-05-07T18:55:10Z

Attempt to close #2651.

Supersedes #2981.

This PR:

Merges config/default_configs/default_edmf_config.yml into config/default_configs/default_config.yml, so that there is a single default (and still over-ridable) config yaml.
Removes job_id from the toml files
Now computes job_id using get_job_id, which joins a string returned from config_id_from_config_file(config_file::String) and a string returned from resource_short_name(::ClimaComms.AbstractCommsContext).
Added config_file to AtmosConfig. config_file is now required for all jobs, and has the same fallback as before. The main difference is that we can inspect (::AtmosConfig).config_file to see where the parameters came from.
I've made config_path and default_config_file constants, so that they can be reused in a few places
configs_per_job_id was renamed to configs_per_config_id, and the values of that returned dict is now a NamedTuple: (; config, config_file), so that, again, we can trace where parameters are coming from

I've not yet removed the duplicate yaml files, but I think this should make doing so much easier.

charleskawczynski · 2024-05-07T19:02:32Z

Deleted job_id: with

function delete_entry(file, entry)
	all_lines = readlines(file)
	open(file, "w") do io
		for line in all_lines
			startswith(line, entry) && continue
			println(io, line)
		end
	end
end
function rm_yaml_entry(path)
    for (root, _, files) in walkdir(path)
        for f in files
            file = joinpath(root, f)
	        endswith(file, ".yml") || continue
	        # endswith(file, "gpu_aquaplanet_diagedmf.yml") || continue
	        delete_entry(file, "job_id:")
        end
    end
end
rm_yaml_entry("config")

Sbozzolo

Deleted job_id: with

function delete_entry(file, entry)
	all_lines = readlines(file)
	open(file, "w") do io
		for line in all_lines
			startswith(line, entry) && continue
			println(io, line)
		end
	end
end
function rm_yaml_entry(path)
    for (root, _, files) in walkdir(path)
        for f in files
            file = joinpath(root, f)
	        endswith(file, ".yml") || continue
	        # endswith(file, "gpu_aquaplanet_diagedmf.yml") || continue
	        delete_entry(file, "job_id:")
        end
    end
end
rm_yaml_entry("config")

This is what I would have done: find . -name "*.yml" -exec sed -i '/job_id/d' {} \; :)

I think this PR is another workaround that does not really solve the fundamental problem. This PR essentially moves job_id to the name of the yml file and appends the device type with other information to the default output directory.
The code in this PR still fails in certain cases, e.g., I would like to check the impact of threads in our scaling tests, but this PR doesn't distinguish two jobs that are run with the same device and number of MPI processes. This PR also doesn't distinguish two jobs that are run on different architectures (e.g., comparing P100 vs V100 or skylake vs icelake). Having a job_id should be optional and we should not try to guess one for the user. It is fine to push to the user the reponsability to come up with a reasonable output_dir for their jobs. For CI, we can use the BUILDKITE_JOB_ID to ensure that there is no collision.

As it is, this PR also break plotting because ci_plots.jl uses the job_id to determine which type of plot. We could do some string manipulation to get back the job_id. Alternatively we could maintain a dictionary that maps config files with the associated plotting routine, but that also sounds worse than the status quo.

That said, I think that the fundamental distinction that you draw is valid and valuable: config_id that identifies a specific type of simulation, and job_id identifies a specific realization of a simulation (maybe we should call it simulation_name?). What I envision for a target design is this: We should allow both fields and have them user-controllable and optional. Users can use config_id to implement standard workflows for a given setup (e.g., plotting), and can use job_id/simulation_name to organize their library of runs (e.g., ClimaAnalysis could be extended to work with multiple simulations at the same time and identify them from their name). Users can also decide to ignore both of them at no costs, so operations involving config_id and job_id, besides making them available, should be done outside of /src. As an example, we would set and use config_id to implement our CI plots. For CI, we can make sure that we set the output_dir in such a way that there are no collisions (e.g., using the buildkite job id).

Also, I think it would be really good if we ensured that users are able to use ClimaAtmos without any yml file. This would help us making ClimaAtmos a more flexible library for atmospheric simulations.

charleskawczynski · 2024-05-08T13:03:41Z

This is what I would have done: find . -name "*.yml" -exec sed -i '/job_id/d' {} ; :)

Nice!

I think this PR is another workaround that does not really solve the fundamental problem.

Yes, this PR is still a wip, and it doesn't close the issue (allowing multiple jobs to share a yaml), but this PR applies one step along a solution to make that easier.

This PR essentially moves job_id to the name of the yml file and appends the device type with other information to the default output directory.

Yes, this is similar to what you proposed. I ran into some issues in #2981, namely:

The BUILDKITE_JOB_ID, while guaranteed to be unique per job, is not really human readable. So, finding your job on central requires more lookups, which is mostly unfortunate.
In order to preserve the reproducibility (repro) tests, we need to find all BUILDKITE_JOB_IDs that opted into the repro tests, and then move those to the main branch (and know which ones we need to compare against in the future). And this is where we end up with the same problem: we need to map jobs across CI runs and, in this case, we can't use the BUILDKITE_JOB_ID, because they will not match. We could make a dict and use a sort of fuzzy map, but we would need to use something like config_id+resource_name, and at that point, it seems like we would have the complexity of both solutions, which didn't seem ideal. So, I decided to fall back on not using BUILDKITE_JOB_ID, and see where we can get with config_id+resource_name.

The code in this PR still fails in certain cases, e.g., I would like to check the impact of threads in our scaling tests, but this PR doesn't distinguish two jobs that are run with the same device and number of MPI processes.

Yes, I'm aware of that. Frankly, I don't like that our scaling pipeline runs exactly the same job twice, I think we can fix that and waste fewer resources.

This PR also doesn't distinguish two jobs that are run on different architectures (e.g., comparing P100 vs V100 or skylake vs icelake).

Fair point, but our previous job ID didn't do that either. I'd like to solve one problem at a time, for now, let's just focus on pulling job_id out of the toml file.

Having a job_id should be optional and we should not try to guess one for the user.

Fair point, I considered this. I wanted to see what this solution looked like, first. I very well may change things so that we simply also specify the job_id in the parsed args. It's notable that we currently have only one argument (the config file). I forget if there are calibration requirements/limitations, but I suspect it shouldn't be an issue.

It is fine to push to the user the reponsability to come up with a reasonable output_dir for their jobs. For CI, we can use the BUILDKITE_JOB_ID to ensure that there is no collision.

As I mentioned above, this ended up not working (well) with the reproducibility tests.

As it is, this PR also break plotting because ci_plots.jl uses the job_id to determine which type of plot. We could do some string manipulation to get back the job_id. Alternatively we could maintain a dictionary that maps config files with the associated plotting routine, but that also sounds worse than the status quo.

If we do a good enough job with the new job_id, I think/hope that we can solve the plotting issues, I've not tackled that yet as I've only just got CI to reach the end of simulations in this PR.

That said, I think that the fundamental distinction that you draw is valid and valuable: config_id that identifies a specific type of simulation, and job_id identifies a specific realization of a simulation (maybe we should call it simulation_name?).

Agreed. I kind of like the name config_id, as it exists before a simulation is complete, but naming is perhaps a separate concern.

What I envision for a target design is this: We should allow both fields and have them user-controllable and optional. Users can use config_id to implement standard workflows for a given setup (e.g., plotting), and can use job_id/simulation_name to organize their library of runs (e.g., ClimaAnalysis could be extended to work with multiple simulations at the same time and identify them from their name). Users can also decide to ignore both of them at no costs, so operations involving config_id and job_id, besides making them available, should be done outside of /src. As an example, we would set and use config_id to implement our CI plots. For CI, we can make sure that we set the output_dir in such a way that there are no collisions (e.g., using the buildkite job id).

both fields meaning config_id and job_id? Regarding workflows, I think we should apply the keep it simple principle.

Also, I think it would be really good if we ensured that users are able to use ClimaAtmos without any yml file. This would help us making ClimaAtmos a more flexible library for atmospheric simulations.

We currently do require yaml files (that's where defaults are kept). Although users could technically bypass them since AtmosConfig is not protected by an inner constructor. That said, I don't think anyone will ever use the default constructor for a simulation since it would be pretty laborious to configure things.

Sbozzolo · 2024-05-08T14:18:08Z

Oh, you are right. I got confused and I had something different in mind for the BUILDKITE_JOB_ID. I absolutetely agree that we have to have someone human readable. Maybe we could add and ENV variable to the pipeline to control the output dir? This would be set by a human and fixed.

Agreed. I kind of like the name config_id, as it exists before a simulation is complete, but naming is perhaps a separate concern.

I like config_id too. It is job_id that I find a little more harder to grasp.

both fields meaning config_id and job_id? Regarding workflows, I think we should apply the keep it simple principle.

Yes, to me, "keep it simple" means that ClimaAtmos should use neither of those internally. They are just labels. We can support them for the use cases described above, but that's up to users to do something with them. And one of the users would be our CI.

We currently do require yaml files (that's where defaults are kept). Although users could technically bypass them since AtmosConfig is not protected by an inner constructor. That said, I don't think anyone will ever use the default constructor for a simulation since it would be pretty laborious to configure things.

Exactly, and this is something I would like to fix. Setting up an AtmosConfig is already a mess that tangles together multiple files from multiple repos, so I would like use to reduce the indirection required to build an AtmosConfig, not add more steps. It's impossible to know what parameters/configurations are needed and which are not. All the default parameters are passed all the times and it is impossible to know which are "active" and which are not. These are recipes for unexpected behaviors as soon as someone goes off the beaten path and making learning how Atmos works harder that it has to be. Fixing this is a much harder problem, but, to begin with, we should keep in mind that a configuration is fundamentally a list of key-value pairs, so the fact that the list comes from a file (or multiple files) should be a technical detail not a requirement.

Let's just focus on pulling job_id out of the toml file.

I suggest:

we rename job_id to config_id,
we make sure that no id is not required to run ClimaAtmos (we will still use ids in our pipelines, but one shoule be able to use Atmos without them),
and use ENV variables to control collisions in our CI

charleskawczynski added 5 commits May 7, 2024 13:18

Define resource_short_name

01bf70f

configs_per_job_id -> configs_per_config_id

90a2789

Update AtmosConfig

9b958c4

Remove job_id from yaml files

43ddac5

Merge default config files, get job_id from config

400833f

charleskawczynski force-pushed the ck/revamp_job_id branch from db07d19 to 400833f Compare May 7, 2024 19:01

charleskawczynski force-pushed the ck/revamp_job_id branch 4 times, most recently from 617680d to 8beb90a Compare May 8, 2024 02:39

charleskawczynski requested review from nefrathenrici and Sbozzolo May 8, 2024 02:39

charleskawczynski force-pushed the ck/revamp_job_id branch from 8beb90a to a305803 Compare May 8, 2024 02:56

Sbozzolo reviewed May 8, 2024

View reviewed changes

Fixes

8b679cd

charleskawczynski force-pushed the ck/revamp_job_id branch from a305803 to 8b679cd Compare May 8, 2024 13:14

charleskawczynski mentioned this pull request May 9, 2024

Revamp job_id, take 2 #2994

Merged

charleskawczynski closed this May 9, 2024

charleskawczynski deleted the ck/revamp_job_id branch August 15, 2024 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp job IDs #2986

Revamp job IDs #2986

charleskawczynski commented May 7, 2024 •

edited

Loading

charleskawczynski commented May 7, 2024

Sbozzolo left a comment

charleskawczynski commented May 8, 2024 •

edited

Loading

Sbozzolo commented May 8, 2024

Revamp job IDs #2986

Revamp job IDs #2986

Conversation

charleskawczynski commented May 7, 2024 • edited Loading

charleskawczynski commented May 7, 2024

Sbozzolo left a comment

Choose a reason for hiding this comment

charleskawczynski commented May 8, 2024 • edited Loading

Sbozzolo commented May 8, 2024

charleskawczynski commented May 7, 2024 •

edited

Loading

charleskawczynski commented May 8, 2024 •

edited

Loading