Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU + GPU runs can't share yml files #2651

Closed
juliasloan25 opened this issue Feb 8, 2024 · 14 comments · Fixed by #2994
Closed

CPU + GPU runs can't share yml files #2651

juliasloan25 opened this issue Feb 8, 2024 · 14 comments · Fixed by #2994
Labels
bug Something isn't working

Comments

@juliasloan25
Copy link
Member

It would be nice to use the same yml config files for CPU and GPU with the same setup, and we can do this for individual runs. However, this fails (in the coupler) if both runs are included in the same buildkite job because ClimaAtmos's get_simulation tries to access the same NetCDF files for both runs. E.g. see here. Maybe this is because the filename is based on run_name? If so, we could just prepend the device type to get unique filenames for CPU vs GPU

cc @Sbozzolo

@juliasloan25 juliasloan25 added the bug Something isn't working label Feb 8, 2024
@Sbozzolo
Copy link
Member

Sbozzolo commented Feb 8, 2024

The default output folder is the job name, so your simulations are writing to the same output folder. The issue is not with CPU/GPU specifically, but with "we cannot run two runs with the same yml file in the same buildkite". I am not sure what's the cleanest way to allow this. Prepending the device type doesn't really fix the problem, it only allows you to run one CPU and one GPU job in the same build. Maybe this is enough, but it would be nice to have a more robust solution.

@charleskawczynski
Copy link
Member

Yeah, they can't share the same yaml file if the job ID is in there. Perhaps the job ID shouldn't be in the yaml file? It's kind of a derived string name from the configuration. It'd still be good to export the job_id to the output folder (xref: #2659), but excluding it from the yaml would fix this issue since the job ID can remain different with the same yaml. The job ID is basically always prescribed anyway

@nefrathenrici
Copy link
Member

I think that we could start using the autogenerated job IDs. My only concerns are guaranteeing uniqueness and some clean way to deal with continuity if we switch over.

@charleskawczynski
Copy link
Member

One other difficulty is how to specify the artifacts folder (which basically should be the job id)

@charleskawczynski
Copy link
Member

One real potential option, which might suit us well in ClimaAtmos, is using a dynamic pipeline (at least for experiments)

@charleskawczynski
Copy link
Member

This seems to be a big issue. For example, gpu_prognostic_edmfx_aquaplanet and prognostic_edmfx_aquaplanet appear to have many differences aside from cpu vs gpu. This is also an issue with the performance analysis configs falling out of sync with the actual model runs. I think we need to prioritize this.

What I'd like to propose is:

  • Remove job_id from the config
  • Remove configs for perf jobs. Instead, make all performance jobs only take the job_id, plus overrides. This will force that they don't fall out of sync.
  • For resolving the netcdf/ race condition issues, we could check if we're on buildkite, and if so, prefix by the buildkite job id. This

@tapios, is it alright if I fix this so that I can ensure performance updates are consistent and accurate?

@tapios
Copy link
Contributor

tapios commented May 4, 2024

I'm not sure I fully understand the proposed solution here.

  • Remove configs for perf jobs. Instead, make all performance jobs only take the job_id, plus overrides. This will force that they don't fall out of sync.

I do not understand what you mean by removing configs for perf jobs. Would you not use the standard yaml files for perf jobs? How would you specify the config through the job id?

@Sbozzolo?

@Sbozzolo
Copy link
Member

Sbozzolo commented May 4, 2024

I also don't understand the proposed solution, but I fully agree that what we are doing is really messy and comes with some unresonable workflows.

For the particular problem at hand, the need of having two identical configuration files because we run one on cpu and the other on gpu and the output_dir name is determined by job_id, the simplest workaround is to append the device type to the output_dir and get rid of one group of configurations.

This is accomplished with something like (in type_getter.jl)

-default_output = haskey(ENV, "CI") ? job_id : joinpath("output", job_id)
+default_output = haskey(ENV, "CI") ? job_id : joinpath("output", job_id, "_", nameof(typeof(ClimaComms.device()))
out_dir = parsed_args["output_dir"]
base_output_dir = isnothing(out_dir) ? default_output : out_dir

where we append the device type to the output folder.

Taking a higher-level view of what we are doing here, I think this is not a solution, but a workaround and a symptom of larger usability issues. Other usability problems are (non-exclusive and non-ordered list):

  • Running sweeps of parameters requires creating one file per value even when only one configuration changes
  • Due to Reading parameters from toml in the yaml file assumes that the toml is in the working directory #2167, there's only one place in one's filesystem where ClimaAtmos can be run
  • In my opinion, the details of how we run buildkite should not enter the source code of ClimaAtmos
  • Some parameters are set with files, others are with environemnt variables (This snippet also shows that we are reading an environment variable CI and doing something with it, but I cannot tell if this is currently being used or not.)
  • Typos in the yml files are silently ingored (e.g., if I set jobid instead of job_id the code will happily run but using the default job_id)
  • Options that are known to be incompatible are not checked (e.g., you can set 1D topography configurations on 2D runs and the code will crash with no informative error)
  • Options are inconsistent. Some want the bool true, others are string "true", others have specific names, or mix things (e,g., help: "Hyperdiffusion [ClimaHyperdiffusion(ortrue; default), none(orfalse)]"
  • For most options, what options do is impossible to discern without reading the code
  • There can be (and probably there is) a disconnect between what is defined in the defaults_config.yml and what the code actually does/wants
  • Following the chain of what happens in setting a simulation up and understanding what input files are required is very complex
  • Setting up a new simulation that is not exactly what is already available (e..g, changing the initial conditions) is difficult and requires a great deal of detailed understanding on how the code works.

@tapios
Copy link
Contributor

tapios commented May 5, 2024

Some comments on this:

  • The configuration should just include the configuration, not a job id (if there were an ID, it would be a config ID, which should be a reasonably clear name, not a number).
  • Having one config file per parameter in parameter sweeps is fine. In fact, for calibration it is necessary. We should have a complete config file associated with each run. It's fine, though, to automatically manipulate config files for parameter files (we typically used to do this in shell scripts--it was a perfectly fine workflow).
  • Cleaning up the config file options, and improving error handling, seems important, and we should take this on at least gradually. However, this seems to be separate from the immediate issue here, which is having output files named by job id and having job id's in the config.

So, my suggestion is to fix the immediate issue here. I think just appending the device type to a job (or config) ID just kicks the can down the road. What we used to do is having a run script that specified an output directory. If you want to generate something automatically, how about just appendix the system date and time to some ID that identifies the configuration and maybe device?

@charleskawczynski
Copy link
Member

charleskawczynski commented May 5, 2024

I do not understand what you mean by removing configs for perf jobs. Would you not use the standard yaml files for perf jobs? How would you specify the config through the job id?

Sorry, I wasn't clear. The perf jobs would still use (existing) yaml files. The difference would be in how they're specified. Rather than having their own separate yaml configs that we have to keep in sync with those we're trying to understand the performance of, we can instead specify a "target config" (e.g., target_config = job_id), via buildkite, and then we'll grab that yaml using TargetJobConfig.

So, a perf job might look like julia --project=perf perf/benchmark.jl --yaml_config = "implicit_baroclinic_wave" --job_id="benchmark_implicit_baroclinic_wave". And then inside benchmark.jl we can grab the config using TargetJobConfig.

For the particular problem at hand, the need of having two identical configuration files because we run one on cpu and the other on gpu and the output_dir name is determined by job_id, the simplest workaround is to append the device type to the output_dir and get rid of one group of configurations.

One issue with this is that we'd then need variations for all compute resources (context, threads, cpu/gpu). Instead, I suggest we try to instead check if we're on buildkite:

  • If we are not, then just use the specified job ID
  • If we are, then use the BUILDKITE_JOB_ID. This will ensure that there cannot be collisions between any jobs running in parallel (not just in a single pipeline, but among all pipelines).

Other usability problems are (non-exclusive and non-ordered list):...

I agree with @tapios on this, for the sake of making incremental improvement on this issue, I'd suggest we keep the scope narrow, and open separate issues for these other items.

I think using BUILDKITE_JOB_ID is the most general solution to the problem that we have (on buildkite, which is the only way that we're currently running jobs in parallel). When we eventually run jobs on GCP, I'm sure there will be similar env variables we could use to fix this, but we can tackle that when we get there.

@tapios
Copy link
Contributor

tapios commented May 5, 2024

This sounds fine to me as a workaround for now. I'd be grateful if you implement it.

We then should have a broader discussion (again) about how we specify model configuration parameters (resolution etc), adjustable model parameters (ClimaParams), and details specific to a model run (which may include initial conditions and an output dir). We should be able to compose simulations by varying all of these in the end.

@Sbozzolo
Copy link
Member

Sbozzolo commented May 6, 2024

I do not understand what you mean by removing configs for perf jobs. Would you not use the standard yaml files for perf jobs? How would you specify the config through the job id?

Sorry, I wasn't clear. The perf jobs would still use (existing) yaml files. The difference would be in how they're specified. Rather than having their own separate yaml configs that we have to keep in sync with those we're trying to understand the performance of, we can instead specify a "target config" (e.g., target_config = job_id), via buildkite, and then we'll grab that yaml using TargetJobConfig.

So, a perf job might look like julia --project=perf perf/benchmark.jl --yaml_config = "implicit_baroclinic_wave" --job_id="benchmark_implicit_baroclinic_wave". And then inside benchmark.jl we can grab the config using TargetJobConfig.

For the particular problem at hand, the need of having two identical configuration files because we run one on cpu and the other on gpu and the output_dir name is determined by job_id, the simplest workaround is to append the device type to the output_dir and get rid of one group of configurations.

One issue with this is that we'd then need variations for all compute resources (context, threads, cpu/gpu). Instead, I suggest we try to instead check if we're on buildkite:

  • If we are not, then just use the specified job ID
  • If we are, then use the BUILDKITE_JOB_ID. This will ensure that there cannot be collisions between any jobs running in parallel (not just in a single pipeline, but among all pipelines).

Other usability problems are (non-exclusive and non-ordered list):...

I agree with @tapios on this, for the sake of making incremental improvement on this issue, I'd suggest we keep the scope narrow, and open separate issues for these other items.

I think using BUILDKITE_JOB_ID is the most general solution to the problem that we have (on buildkite, which is the only way that we're currently running jobs in parallel). When we eventually run jobs on GCP, I'm sure there will be similar env variables we could use to fix this, but we can tackle that when we get there.

I like the suggestion for perf (but do we need the job_id argument?).

And what you suggest for the collision can be implemented in just one line in the driver.jl :)

haskey(ENV, "BUILDKITE_JOB_ID") && (config.parsed_args["output_dir"] = ENV["BUILDKITE_JOB_ID"])

I also think the problem is a problem for CI only (where we use the default output_dir, which is set to job_id). Real users and calibration will submit jobs with a specified output_dir (possibly procedurally generated with scripts).

@charleskawczynski
Copy link
Member

I like the suggestion for perf (but do we need the job_id argument?).

Yeah, we may be able to get rid of it for the perf jobs. I'll see if there's any edge cases we didn't think of.

And what you suggest for the collision can be implemented in just one line in the driver.jl

Exactly. We can add that if it's needed, but we may not even need it since a lot of the perf jobs have their own entry scripts (e.g., perf/benchmark.jl, perf/jet.jl, etc.)

also think the problem is a problem for CI only (where we use the default output_dir, which is set to job_id). Real users and calibration will submit jobs with a specified output_dir (possibly procedurally generated with scripts).

Yes, agreed.

@Sbozzolo
Copy link
Member

Sbozzolo commented May 6, 2024

Exactly. We can add that if it's needed, but we may not even need it since a lot of the perf jobs have their own entry scripts (e.g., perf/benchmark.jl, perf/jet.jl, etc.)

It is needed: collisions occur also outside of perf jobs, for example, we have gpu_aquaplanet_dyamond_ss_1process.yml, gpu_aquaplanet_dyamond_ss_2process.yml, gpu_aquaplanet_dyamond_ss_4process,yml, which are identical except for the job_id (their only difference is that they are run with a different number of MPI processes).

This was referenced May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants