-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU + GPU runs can't share yml files #2651
Comments
The default output folder is the job name, so your simulations are writing to the same output folder. The issue is not with CPU/GPU specifically, but with "we cannot run two runs with the same yml file in the same buildkite". I am not sure what's the cleanest way to allow this. Prepending the device type doesn't really fix the problem, it only allows you to run one CPU and one GPU job in the same build. Maybe this is enough, but it would be nice to have a more robust solution. |
Yeah, they can't share the same yaml file if the job ID is in there. Perhaps the job ID shouldn't be in the yaml file? It's kind of a derived string name from the configuration. It'd still be good to export the |
I think that we could start using the autogenerated job IDs. My only concerns are guaranteeing uniqueness and some clean way to deal with continuity if we switch over. |
One other difficulty is how to specify the artifacts folder (which basically should be the job id) |
One real potential option, which might suit us well in ClimaAtmos, is using a dynamic pipeline (at least for experiments) |
This seems to be a big issue. For example, What I'd like to propose is:
@tapios, is it alright if I fix this so that I can ensure performance updates are consistent and accurate? |
I'm not sure I fully understand the proposed solution here.
I do not understand what you mean by removing configs for perf jobs. Would you not use the standard yaml files for perf jobs? How would you specify the config through the job id? |
I also don't understand the proposed solution, but I fully agree that what we are doing is really messy and comes with some unresonable workflows. For the particular problem at hand, the need of having two identical configuration files because we run one on cpu and the other on gpu and the output_dir name is determined by job_id, the simplest workaround is to append the device type to the This is accomplished with something like (in -default_output = haskey(ENV, "CI") ? job_id : joinpath("output", job_id)
+default_output = haskey(ENV, "CI") ? job_id : joinpath("output", job_id, "_", nameof(typeof(ClimaComms.device()))
out_dir = parsed_args["output_dir"]
base_output_dir = isnothing(out_dir) ? default_output : out_dir where we append the device type to the output folder. Taking a higher-level view of what we are doing here, I think this is not a solution, but a workaround and a symptom of larger usability issues. Other usability problems are (non-exclusive and non-ordered list):
|
Some comments on this:
So, my suggestion is to fix the immediate issue here. I think just appending the device type to a job (or config) ID just kicks the can down the road. What we used to do is having a run script that specified an output directory. If you want to generate something automatically, how about just appendix the system date and time to some ID that identifies the configuration and maybe device? |
Sorry, I wasn't clear. The perf jobs would still use (existing) yaml files. The difference would be in how they're specified. Rather than having their own separate yaml configs that we have to keep in sync with those we're trying to understand the performance of, we can instead specify a "target config" (e.g., So, a perf job might look like
One issue with this is that we'd then need variations for all compute resources (context, threads, cpu/gpu). Instead, I suggest we try to instead check if we're on buildkite:
I agree with @tapios on this, for the sake of making incremental improvement on this issue, I'd suggest we keep the scope narrow, and open separate issues for these other items. I think using |
This sounds fine to me as a workaround for now. I'd be grateful if you implement it. We then should have a broader discussion (again) about how we specify model configuration parameters (resolution etc), adjustable model parameters (ClimaParams), and details specific to a model run (which may include initial conditions and an output dir). We should be able to compose simulations by varying all of these in the end. |
I like the suggestion for perf (but do we need the And what you suggest for the collision can be implemented in just one line in the haskey(ENV, "BUILDKITE_JOB_ID") && (config.parsed_args["output_dir"] = ENV["BUILDKITE_JOB_ID"]) I also think the problem is a problem for CI only (where we use the default output_dir, which is set to job_id). Real users and calibration will submit jobs with a specified output_dir (possibly procedurally generated with scripts). |
Yeah, we may be able to get rid of it for the perf jobs. I'll see if there's any edge cases we didn't think of.
Exactly. We can add that if it's needed, but we may not even need it since a lot of the perf jobs have their own entry scripts (e.g.,
Yes, agreed. |
It is needed: collisions occur also outside of perf jobs, for example, we have |
It would be nice to use the same yml config files for CPU and GPU with the same setup, and we can do this for individual runs. However, this fails (in the coupler) if both runs are included in the same buildkite job because ClimaAtmos's
get_simulation
tries to access the same NetCDF files for both runs. E.g. see here. Maybe this is because the filename is based onrun_name
? If so, we could just prepend the device type to get unique filenames for CPU vs GPUcc @Sbozzolo
The text was updated successfully, but these errors were encountered: