Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Float32 to Float64 conversions in RRTMGP Interface #2635

Closed
sriharshakandala opened this issue Feb 6, 2024 · 30 comments · Fixed by #2704
Closed

Remove Float32 to Float64 conversions in RRTMGP Interface #2635

sriharshakandala opened this issue Feb 6, 2024 · 30 comments · Fixed by #2704
Assignees

Comments

@sriharshakandala
Copy link
Member

Remove Float32 to Float64 conversions and back in RRTMGP Interface
This lets us utilize faster RRTMGP single precision calculations for Float32 runs.
Preliminary RRTMGP.jl timing tests indicate about 27% reduction in time vs the Float64 runs.
There could be additional savings from eliminating conversions

@sriharshakandala
Copy link
Member Author

@charleskawczynski
Copy link
Member

Unfortunately, a bunch of simulations are not stable at Float32: https://buildkite.com/clima/climaatmos-ci/builds/16548.

@sriharshakandala
Copy link
Member Author

sriharshakandala commented Feb 6, 2024

https://buildkite.com/clima/climaatmos-ci/builds/16548

The problem seems to be with saturation adjustment! Is it possible to tune the tolerances/maxiter for single precision? It may not make sense to use higher precision just for the radiation component.

cc: @akshaysridhar , @szy21

@szy21
Copy link
Member

szy21 commented Feb 7, 2024

The real problem is that the simulations are not stable, not saturation adjustment (the input to SA is wrong). If you are sure the single precision change is correct, you can try reducing dt for the unstable jobs. gpu_aquaplanet_dyamond seems to break immediately, so it would be good to double check if the change is done correctly.

@sriharshakandala
Copy link
Member Author

The real problem is that the simulations are not stable, not saturation adjustment (the input to SA is wrong). If you are sure the single precision change is correct, you can try reducing dt for the unstable jobs. gpu_aquaplanet_dyamond seems to break immediately, so it would be good to double check if the change is done correctly.

Our tests (e.g. https://buildkite.com/clima/rrtmgp-ci/builds/562#018d7f62-627e-4bb4-b9f9-107f08405aa5 contains the latest build) indicate that single precision runs generate results that are pretty close to the results from double precision runs, both in relative and absolute terms!

@szy21
Copy link
Member

szy21 commented Feb 7, 2024

Ok, try reducing dt a bit then. I would be surprised if some small change in radiation results in instability though, so let's double check if you need to reduce dt a lot to make it stable.

@charleskawczynski
Copy link
Member

gpu_aquaplanet_dyamond seems to break immediately, so it would be good to double check if the change is done correctly.

I noticed that too. It's a bit curious. But, as you noted, thermodynamics is already running / working in Float32, so I'm wondering if we're somehow exercising RRTMGP in ways that are not exercised in RRTMGP.jl's test suite.

@szy21
Copy link
Member

szy21 commented Feb 8, 2024

Actually, single_column_radiative_equilibrium_allsky_idealized_clouds is unstable too. The plots are empty, so it probably finished with all NaNs. This is a bit concerning. It seems to me the clear-sky radiation is fine, so maybe there is some issue with the all-sky radiation.

(It would be good to catch this without plotting)

@LenkaNovak
Copy link
Contributor

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

@szy21
Copy link
Member

szy21 commented Feb 8, 2024

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

That is probably because we change from 0 or 1 cloud fraction to partial cloud fraction.

@LenkaNovak
Copy link
Contributor

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

That is probably because we change from 0 or 1 cloud fraction to partial cloud fraction.

Why would this lead to less stable sims though? 🤔

@szy21
Copy link
Member

szy21 commented Feb 8, 2024

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

That is probably because we change from 0 or 1 cloud fraction to partial cloud fraction.

Why would this lead to less stable sims though? 🤔

Not really, just this will result in significant change in the results:) You can test it by setting cloud_model: "grid_scale"

@szy21
Copy link
Member

szy21 commented Feb 8, 2024

Actually this reminds me that I should change the gpu_aquaplanet_dyamond to cloud_model: "grid_scale". I will do that after we sort out the instability issue.

@LenkaNovak
Copy link
Contributor

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

That is probably because we change from 0 or 1 cloud fraction to partial cloud fraction.

Why would this lead to less stable sims though? 🤔

Not really, just this will result in significant change in the results:) You can test it by setting cloud_model: "grid_scale"

could_model: "grid_scale" doesn't fix our coupler issue unfortunately. For now, we're resorting gray radiation (since this only breaks for the coarse setups in our tests), but it would be good to understand this more. Did anything else (that's not reverted by could_model: "grid_scale") change in radiation recently?

@szy21
Copy link
Member

szy21 commented Feb 8, 2024

could_model: "grid_scale" doesn't fix our coupler issue unfortunately. For now, we're resorting gray radiation (since this only breaks for the coarse setups in our tests), but it would be good to understand this more. Did anything else (that's not reverted by could_model: "grid_scale") change in radiation recently?

Nothing else should have physical changes, but the recent refactor of rrtmgp results in numerical changes (all atmos ci and longrun passed though).

(Also this issue/PR should not affect you, as we are still using Float64 for radiation in the latest atmos release, in case that's confusing)

@szy21
Copy link
Member

szy21 commented Feb 8, 2024

could_model: "grid_scale" doesn't fix our coupler issue unfortunately. For now, we're resorting gray radiation (since this only breaks for the coarse setups in our tests), but it would be good to understand this more. Did anything else (that's not reverted by could_model: "grid_scale") change in radiation recently?

Nothing else should have physical changes, but the recent refactor of rrtmgp results in numerical changes (all atmos ci and longrun passed though).

(Also this issue/PR should not affect you, as we are still using Float64 for radiation in the latest atmos release, in case that's confusing)

We haven't tested coarse resolution amip run for more than 1 day though, as that's not our focus, so it could be the atmos one is not stable as well.

@szy21
Copy link
Member

szy21 commented Feb 11, 2024

It seems there are some edge cases in SW fluxes that are wrong with Float32. When I run single_column_radiative_equilibrium_allsky_idealized_clouds locally the simulation breaks, even with smaller timesteps. When using dt: 1hours before the simulation breaks I got

julia> p.radiation.radiation_model.face_sw_flux_dn
71×1 view(::Matrix{Float32}, 1:71, :) with eltype Float32:
 2.7108308f-5
 2.7150347f-5
 2.7194521f-5
 2.7240849f-5
 2.7289268f-5
 2.733982f-5
 2.7392463f-5
 ⋮
 4.6085464f10
 4.620472f10
 4.626888f10
 4.6299476f10
 4.6311473f10
 4.631403f10

julia> p.radiation.radiation_model.face_sw_flux_up
71×1 view(::Matrix{Float32}, 1:71, :) with eltype Float32:
      1.0301155f-5
      1.0301945f-5
      1.030292f-5
      1.0304107f-5
      1.0305519f-5
      1.03071725f-5
      1.0309083f-5
      ⋮
 388453.97
 401662.0
 412144.22
 419960.9
 425306.47
      2.2088827f18

julia> p.radiation.radiation_model.weighted_irradiance
1-element Vector{Float32}:
 1408.6777

julia> p.radiation.radiation_model.cos_zenith
1-element Vector{Float32}:
 7.54979f-8

So something is clearly wrong. Maybe it has something to do with some threshold in RRTMGP? @sriharshakandala any idea?

@szy21
Copy link
Member

szy21 commented Feb 11, 2024

Maybe @tapios knows if any threshold in RRTMGP could lead to errors in SW fluxes? The problem seems to happen when cos_zenith_angle is very small. In ClimaAtmos we limit the zenith angle by max_zenith_angle = FT(π) / 2 - eps(FT).

@szy21
Copy link
Member

szy21 commented Feb 11, 2024

Ok, I don't think it has anything to do with clear-sky or all-sky. When I run single column clear-sky locally with time-varying insolation with Float32 it breaks too. It seems to me it has something to do with errors in SW fluxes when zenith angle is large. Float32 leads to very wrong SW fluxes. (When zenith angle is larger than 90 degree, cos_zenith_angle is 7.54979f-8 for Float32 and 2.83276944882399e-16 for Floa64, from the limit above, if that helps.)

It's not only the zenith angle though. When we start with large zenith angle, the simulation doesn't break immediately, so it has something to do with the atmospheric state as well.

@tapios
Copy link
Contributor

tapios commented Feb 11, 2024

Don't we have some catch that when abs(SZA) > pi/2, cos(SZA) = 0? If not, let's add it.

Can you please point me to the relevant pieces of code?

@tapios
Copy link
Contributor

tapios commented Feb 11, 2024

We should add a conditional here to make sure all fluxes are zero when zenith_angle > pi/2.

I am concerned that with our current -eps construct, we run into problems for fluxes such as here, where we have an exponential exp(-τ_sum / μ₀) that leads to wrong evaluations at float32.

This may not be the only problem, but it could be one problem.

Also, the limiting of the zenith angle should not live in callbacks in the atmosphere model, but in RRTMG. Basically, we should avoid the entire loop over spectral bands when abs(zenith_angle) >= pi/2.

@szy21
Copy link
Member

szy21 commented Feb 12, 2024

@tapios Thank you! I did a quick hack in ClimaAtmos to set all SW fluxes to zero when zenith angle is large and that seems to solve the problem: see 9679498 (The hard-coded numbers are to match our current -eps threshold). I haven't figured out which lines in RRTMGP cause the issue (exp(-τ_sum / μ₀) seems fine), but @sriharshakandala let's add this to RRTMGP?

@szy21
Copy link
Member

szy21 commented Feb 12, 2024

Also this doesn't show up in RRTMGP test cases. I think we have test cases with large zenith angle, but maybe the problem is caused by some combination of optical depth and zenith angle. Is there a way to test it in RRTMGP so we won't have similar issues in the future? Maybe we can add these edge cases we found in climaatmos to rrtmgp test cases?

@sriharshakandala
Copy link
Member Author

sriharshakandala commented Feb 12, 2024

Also this doesn't show up in RRTMGP test cases. I think we have test cases with large zenith angle, but maybe the problem is caused by some combination of optical depth and zenith angle. Is there a way to test it in RRTMGP so we won't have similar issues in the future? Maybe we can add these edge cases we found in climaatmos to rrtmgp test cases?

In RRTMGP, it is handled upstream in the tests, e.g. here: https://github.com/CliMA/RRTMGP.jl/blob/main/test/read_rfmip_clear_sky.jl#L38 for the clear-sky case.
For all-sky case, we just take one column with a specific zenith angle and repeat it! That's the reason we do not explicitly have this check in that test!

@szy21
Copy link
Member

szy21 commented Feb 12, 2024

It might be this line. I wonder besides setting SW fluxes to zero, should we add a minimum value for the denominator (FT(1) - Rdif * albedo_ilev)?

@szy21
Copy link
Member

szy21 commented Feb 13, 2024

I did some more tests on this. With Float32 there are issues even with cos_zenith = 1.2e-4, which corresponds to cos(pi/2 - 1000*eps(FT)). There seem to be no issues with cos_zenith = 3.5e-4, which corresponds to cos(pi/2 - sqrt(eps(FT))). So we may need to set SW fluxes to zero when zenith angle is larger than pi/2 - sqrt(eps(FT)). Right now pi/2 - eps(FT) works because we use a single column and call radiation every hour so we don't get a zenith angle that is smaller but very close to pi/2. But we may get it on the sphere when calling radiation more frequently? I think it would be good to figure out what the issue is. Or are we ok with pi/2 - sqrt(eps(FT))? @tapios @sriharshakandala What do you think?

@tapios
Copy link
Contributor

tapios commented Feb 13, 2024

I'm fine with having pi/2 - sqrt(eps(FT)) as the limit, but it would be nice to trace down where the problem occurs. Can you do it in the REPL, e.g., assuming zero SW opacity? Or is it dependent on SW optical depth?

@szy21
Copy link
Member

szy21 commented Feb 13, 2024

I'm fine with having pi/2 - sqrt(eps(FT)) as the limit, but it would be nice to trace down where the problem occurs. Can you do it in the REPL, e.g., assuming zero SW opacity? Or is it dependent on SW optical depth?

It is dependent on SW optical depth, as for the same zenith angle sometimes it breaks and sometimes it doesn't. I will see if I can try modifying the SW optical properties.

@szy21
Copy link
Member

szy21 commented Feb 14, 2024

We found that it's because this line can be negative with both Float32 and Float64, but it's small enough and didn't cause issues with Float64. Limiting it to zero works fine in standalone RRTMGP cases. We will test ClimaAtmos with this fix in RRTMGP.

@sriharshakandala
Copy link
Member Author

We found that it's because this line can be negative with both Float32 and Float64, but it's small enough and didn't cause issues with Float64. Limiting it to zero works fine in standalone RRTMGP cases. We will test ClimaAtmos with this fix in RRTMGP.

Thanks @szy21 . I just added this change in PR CliMA/RRTMGP.jl#448

@nefrathenrici nefrathenrici linked a pull request Feb 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants