Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU DYAMOND run breaks down in RRTMGP Interface callbacks #2314

Closed
sriharshakandala opened this issue Oct 30, 2023 · 17 comments
Closed

GPU DYAMOND run breaks down in RRTMGP Interface callbacks #2314

sriharshakandala opened this issue Oct 30, 2023 · 17 comments

Comments

@sriharshakandala
Copy link
Member

sriharshakandala commented Oct 30, 2023

Currently, the GPU aquaplanet DYAMOND configuration simulation is breaking down in RRTMGP callbacks.
The full error message on A100s can be found here: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/125#018b7207-7045-43ab-b6d7-d51001a69e74/138-223
The error seems to be happening at a call to instantaneous_zenith_angle function here: https://github.com/CliMA/ClimaAtmos.jl/blob/main/src/callbacks/callbacks.jl#L118

@sriharshakandala
Copy link
Member Author

cc: @simonbyrne , @charleskawczynski , @tapios

@simonbyrne
Copy link
Member

Looks like the error is that the interpolation objects used by Insolation.jl are not isbits:
https://github.com/CliMA/Insolation.jl/blob/f2ba81ee1e25348951f759e022145ecc8aa20c85/src/Insolation.jl#L23

I had a quick look at Insolation.jl, it seems that there might be a bit of work required to make it GPU compatbile. This might be a project on its own.

@simonbyrne
Copy link
Member

Actually, it might just need to be refactored a bit. The function we're calling is instantaneous_zenith_angle
https://github.com/CliMA/Insolation.jl/blob/f2ba81ee1e25348951f759e022145ecc8aa20c85/src/ZenithAngleCalc.jl#L96C10-L119
at every point. However the only thing that is spatially-varying are the coordinates. So what we should do is split out the call to distance_declination_hourangle (from what I can tell, this should be the same at every point, so could be done on the CPU)
https://github.com/CliMA/Insolation.jl/blob/f2ba81ee1e25348951f759e022145ecc8aa20c85/src/ZenithAngleCalc.jl#L106C19-L106C49
and then pass the result as inputs to instantaneous_zenith_angle (which should then just be a bunch of trig functions)

@charleskawczynski
Copy link
Member

Yep, this is kind of like the interpolation functionality. We can probably write an easy work around

@sriharshakandala
Copy link
Member Author

@tapios
Copy link
Contributor

tapios commented Oct 31, 2023

The calculation of zenith angle involves two separate pieces.

  1. Determination of orbital parameters (eccentricity, obliquity, and longitude of perihelion). These vary on timescales of O(10,000 years), i.e., these are relevant for deeper-time simulations. We should take these parameters to be fixed for what we need now.
  2. Given orbital parameters, latitude, longitude, and time, calculate the zenith angle. The zenith angle is just a trigonometric function that depends parametrically on orbital parameters f(latitude, longitude, time; orbital parameters).

If we take orbital parameters to be fixed (which is what we should do), no interpolations are required.

The calculation in 2) uses auxiliary angles that appear as arguments in trigonometric functions:

  • Hour angle(longitude, time): This measures local solar time as an angle. It only depends on longitude and time.
  • Declination angle(time): This measures the latitude of the subsolar point. It only depends on time.

These are all simple trig function evaluations. It's probably easiest to re-evaluate hour angle and declination angle every time they are needed (rather than doing global calculations and transferring data).

In short, just call insolation with fixed orbital parameters, and the rest is a pointwise function evaluation that should be straightforward on GPUs.

@sriharshakandala
Copy link
Member Author

sriharshakandala commented Nov 6, 2023

print_to_string statements are incompatible with the GPU code (e.g.: https://github.com/CliMA/Insolation.jl/blob/main/src/ZenithAngleCalc.jl#L54 )
The function interface is modified to pass-in date0 as an argument. This enables setting date0 on the CPU.
CliMA/Insolation.jl#60

@sriharshakandala
Copy link
Member Author

I tested ClimaAtmos locally with this branch. The simulation is running!

@sriharshakandala
Copy link
Member Author

It runs to completion on the GPU pipeline here: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/138#018bad4c-6dfa-42fa-89fc-48292ca85d8b
PR CliMA/Insolation.jl#60 fixes the issue.
The corresponding ClimaAtmos PR is at #2341

@szy21
Copy link
Member

szy21 commented Nov 8, 2023

@akshaysridhar Could you check if the TOA shortwave flux in the above pipeline looks correct (it only has hdf5 files)? Thanks!

@akshaysridhar
Copy link
Member

akshaysridhar commented Nov 8, 2023

Build ID: GPU-pipeline 138
Branch: sk/update_insolation
Commit Hash: 8aca193
Purpose: Insolation interface check.
Output: Day1.0.hdf5 (offline remap)

Screenshot 2023-11-08 at 8 43 34 AM

@charleskawczynski
Copy link
Member

Hm, this is running at less than 1 sypd: sypd: 0.4684889732942239. I can't seem to open the nsight report:

Screen Shot 2023-11-08 at 8 53 13 AM

@sriharshakandala, @simonbyrne, have you seen this error before? My nsight systems is up to date, but the error does seem version related. Do I need to downgrade for some reason?

@sriharshakandala
Copy link
Member Author

Hm, this is running at less than 1 sypd: sypd: 0.4684889732942239. I can't seem to open the nsight report:

Screen Shot 2023-11-08 at 8 53 13 AM @sriharshakandala, @simonbyrne, have you seen this error before? My nsight systems is up to date, but the error does seem version related. Do I need to downgrade for some reason?

Yes. We noticed this before. This had to do with the latest mac Nsight systems version being a bit behind the latest Linux version! I believe we restricted the Linux version!

@szy21
Copy link
Member

szy21 commented Nov 8, 2023

Looks good, thanks @akshaysridhar!

@simonbyrne
Copy link
Member

Hm, this is running at less than 1 sypd: sypd: 0.4684889732942239. I can't seem to open the nsight report:

See
https://forums.developer.nvidia.com/t/nsight-systems-2023-3-3-macos-host-available/270948?u=simonbyrne1

Easiest fix is to module load nsight-systems/2023.3.1, which will pick up the compatible version of nsight

@charleskawczynski
Copy link
Member

This is apparently not fixed in the latest update. See build here: https://buildkite.com/clima/climaatmos-ci/builds/16188#018d46aa-43db-4dcb-a2d6-9a20c87d6cf9/163-244

@charleskawczynski
Copy link
Member

This was fixed in a recent RRTMGP update, and the test is now strictly enforced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants