-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp job IDs #2986
Revamp job IDs #2986
Conversation
db07d19
to
400833f
Compare
Deleted function delete_entry(file, entry)
all_lines = readlines(file)
open(file, "w") do io
for line in all_lines
startswith(line, entry) && continue
println(io, line)
end
end
end
function rm_yaml_entry(path)
for (root, _, files) in walkdir(path)
for f in files
file = joinpath(root, f)
endswith(file, ".yml") || continue
# endswith(file, "gpu_aquaplanet_diagedmf.yml") || continue
delete_entry(file, "job_id:")
end
end
end
rm_yaml_entry("config") |
617680d
to
8beb90a
Compare
8beb90a
to
a305803
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted
job_id:
withfunction delete_entry(file, entry) all_lines = readlines(file) open(file, "w") do io for line in all_lines startswith(line, entry) && continue println(io, line) end end end function rm_yaml_entry(path) for (root, _, files) in walkdir(path) for f in files file = joinpath(root, f) endswith(file, ".yml") || continue # endswith(file, "gpu_aquaplanet_diagedmf.yml") || continue delete_entry(file, "job_id:") end end end rm_yaml_entry("config")
This is what I would have done: find . -name "*.yml" -exec sed -i '/job_id/d' {} \;
:)
I think this PR is another workaround that does not really solve the fundamental problem. This PR essentially moves job_id
to the name of the yml file and appends the device type with other information to the default output directory.
The code in this PR still fails in certain cases, e.g., I would like to check the impact of threads in our scaling tests, but this PR doesn't distinguish two jobs that are run with the same device and number of MPI processes. This PR also doesn't distinguish two jobs that are run on different architectures (e.g., comparing P100 vs V100 or skylake vs icelake). Having a job_id should be optional and we should not try to guess one for the user. It is fine to push to the user the reponsability to come up with a reasonable output_dir for their jobs. For CI, we can use the BUILDKITE_JOB_ID to ensure that there is no collision.
As it is, this PR also break plotting because ci_plots.jl
uses the job_id
to determine which type of plot. We could do some string manipulation to get back the job_id
. Alternatively we could maintain a dictionary that maps config files with the associated plotting routine, but that also sounds worse than the status quo.
That said, I think that the fundamental distinction that you draw is valid and valuable: config_id
that identifies a specific type of simulation, and job_id
identifies a specific realization of a simulation (maybe we should call it simulation_name
?). What I envision for a target design is this: We should allow both fields and have them user-controllable and optional. Users can use config_id
to implement standard workflows for a given setup (e.g., plotting), and can use job_id
/simulation_name
to organize their library of runs (e.g., ClimaAnalysis
could be extended to work with multiple simulations at the same time and identify them from their name). Users can also decide to ignore both of them at no costs, so operations involving config_id
and job_id
, besides making them available, should be done outside of /src
. As an example, we would set and use config_id
to implement our CI plots. For CI, we can make sure that we set the output_dir in such a way that there are no collisions (e.g., using the buildkite job id).
Also, I think it would be really good if we ensured that users are able to use ClimaAtmos without any yml file. This would help us making ClimaAtmos a more flexible library for atmospheric simulations.
Nice!
Yes, this PR is still a wip, and it doesn't close the issue (allowing multiple jobs to share a yaml), but this PR applies one step along a solution to make that easier.
Yes, this is similar to what you proposed. I ran into some issues in #2981, namely:
Yes, I'm aware of that. Frankly, I don't like that our scaling pipeline runs exactly the same job twice, I think we can fix that and waste fewer resources.
Fair point, but our previous job ID didn't do that either. I'd like to solve one problem at a time, for now, let's just focus on pulling
Fair point, I considered this. I wanted to see what this solution looked like, first. I very well may change things so that we simply also specify the
As I mentioned above, this ended up not working (well) with the reproducibility tests.
If we do a good enough job with the new
Agreed. I kind of like the name
both fields meaning
We currently do require yaml files (that's where defaults are kept). Although users could technically bypass them since |
a305803
to
8b679cd
Compare
Oh, you are right. I got confused and I had something different in mind for the
I like
Yes, to me, "keep it simple" means that ClimaAtmos should use neither of those internally. They are just labels. We can support them for the use cases described above, but that's up to users to do something with them. And one of the users would be our CI.
Exactly, and this is something I would like to fix. Setting up an AtmosConfig is already a mess that tangles together multiple files from multiple repos, so I would like use to reduce the indirection required to build an AtmosConfig, not add more steps. It's impossible to know what parameters/configurations are needed and which are not. All the default parameters are passed all the times and it is impossible to know which are "active" and which are not. These are recipes for unexpected behaviors as soon as someone goes off the beaten path and making learning how Atmos works harder that it has to be. Fixing this is a much harder problem, but, to begin with, we should keep in mind that a configuration is fundamentally a list of key-value pairs, so the fact that the list comes from a file (or multiple files) should be a technical detail not a requirement.
I suggest:
|
Attempt to close #2651.
Supersedes #2981.
This PR:
config/default_configs/default_edmf_config.yml
intoconfig/default_configs/default_config.yml
, so that there is a single default (and still over-ridable) config yaml.job_id
from the toml filesjob_id
usingget_job_id
, which joins a string returned fromconfig_id_from_config_file(config_file::String)
and a string returned fromresource_short_name(::ClimaComms.AbstractCommsContext)
.config_file
toAtmosConfig
.config_file
is now required for all jobs, and has the same fallback as before. The main difference is that we can inspect(::AtmosConfig).config_file
to see where the parameters came from.config_path
anddefault_config_file
constants, so that they can be reused in a few placesconfigs_per_job_id
was renamed toconfigs_per_config_id
, and the values of that returned dict is now a NamedTuple:(; config, config_file)
, so that, again, we can trace where parameters are coming fromI've not yet removed the duplicate yaml files, but I think this should make doing so much easier.