Remove `Float32` to `Float64` conversions in `RRTMGP Interface` #2635

sriharshakandala · 2024-02-06T21:33:41Z

Remove Float32 to Float64 conversions and back in RRTMGP Interface
This lets us utilize faster RRTMGP single precision calculations for Float32 runs.
Preliminary RRTMGP.jl timing tests indicate about 27% reduction in time vs the Float64 runs.
There could be additional savings from eliminating conversions

The text was updated successfully, but these errors were encountered:

sriharshakandala · 2024-02-06T21:34:20Z

cc: @nefrathenrici @charleskawczynski

charleskawczynski · 2024-02-06T21:42:02Z

Unfortunately, a bunch of simulations are not stable at Float32: https://buildkite.com/clima/climaatmos-ci/builds/16548.

sriharshakandala · 2024-02-06T21:49:31Z

https://buildkite.com/clima/climaatmos-ci/builds/16548

The problem seems to be with saturation adjustment! Is it possible to tune the tolerances/maxiter for single precision? It may not make sense to use higher precision just for the radiation component.

cc: @akshaysridhar , @szy21

szy21 · 2024-02-07T00:09:33Z

The real problem is that the simulations are not stable, not saturation adjustment (the input to SA is wrong). If you are sure the single precision change is correct, you can try reducing dt for the unstable jobs. gpu_aquaplanet_dyamond seems to break immediately, so it would be good to double check if the change is done correctly.

sriharshakandala · 2024-02-07T00:14:00Z

The real problem is that the simulations are not stable, not saturation adjustment (the input to SA is wrong). If you are sure the single precision change is correct, you can try reducing dt for the unstable jobs. gpu_aquaplanet_dyamond seems to break immediately, so it would be good to double check if the change is done correctly.

Our tests (e.g. https://buildkite.com/clima/rrtmgp-ci/builds/562#018d7f62-627e-4bb4-b9f9-107f08405aa5 contains the latest build) indicate that single precision runs generate results that are pretty close to the results from double precision runs, both in relative and absolute terms!

szy21 · 2024-02-07T00:39:46Z

Ok, try reducing dt a bit then. I would be surprised if some small change in radiation results in instability though, so let's double check if you need to reduce dt a lot to make it stable.

charleskawczynski · 2024-02-07T19:27:08Z

gpu_aquaplanet_dyamond seems to break immediately, so it would be good to double check if the change is done correctly.

I noticed that too. It's a bit curious. But, as you noted, thermodynamics is already running / working in Float32, so I'm wondering if we're somehow exercising RRTMGP in ways that are not exercised in RRTMGP.jl's test suite.

szy21 · 2024-02-08T00:15:38Z

Actually, single_column_radiative_equilibrium_allsky_idealized_clouds is unstable too. The plots are empty, so it probably finished with all NaNs. This is a bit concerning. It seems to me the clear-sky radiation is fine, so maybe there is some issue with the all-sky radiation.

(It would be good to catch this without plotting)

LenkaNovak · 2024-02-08T00:15:56Z

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

szy21 · 2024-02-08T00:16:52Z

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

That is probably because we change from 0 or 1 cloud fraction to partial cloud fraction.

LenkaNovak · 2024-02-08T00:17:54Z

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

That is probably because we change from 0 or 1 cloud fraction to partial cloud fraction.

Why would this lead to less stable sims though? 🤔

szy21 · 2024-02-08T00:20:15Z

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

That is probably because we change from 0 or 1 cloud fraction to partial cloud fraction.

Why would this lead to less stable sims though? 🤔

Not really, just this will result in significant change in the results:) You can test it by setting cloud_model: "grid_scale"

szy21 · 2024-02-08T00:21:40Z

Actually this reminds me that I should change the gpu_aquaplanet_dyamond to cloud_model: "grid_scale". I will do that after we sort out the instability issue.

LenkaNovak · 2024-02-08T01:19:44Z

We recently noticed in the coupler that after the CA v0.20 our coarse simulations are less stable as well, especially those with allsky radiation (we thought it was a function in TD that changed, but then tested that wasn't it).

That is probably because we change from 0 or 1 cloud fraction to partial cloud fraction.

Why would this lead to less stable sims though? 🤔

Not really, just this will result in significant change in the results:) You can test it by setting cloud_model: "grid_scale"

could_model: "grid_scale" doesn't fix our coupler issue unfortunately. For now, we're resorting gray radiation (since this only breaks for the coarse setups in our tests), but it would be good to understand this more. Did anything else (that's not reverted by could_model: "grid_scale") change in radiation recently?

szy21 · 2024-02-08T02:05:54Z

could_model: "grid_scale" doesn't fix our coupler issue unfortunately. For now, we're resorting gray radiation (since this only breaks for the coarse setups in our tests), but it would be good to understand this more. Did anything else (that's not reverted by could_model: "grid_scale") change in radiation recently?

Nothing else should have physical changes, but the recent refactor of rrtmgp results in numerical changes (all atmos ci and longrun passed though).

(Also this issue/PR should not affect you, as we are still using Float64 for radiation in the latest atmos release, in case that's confusing)

szy21 · 2024-02-08T02:30:46Z

could_model: "grid_scale" doesn't fix our coupler issue unfortunately. For now, we're resorting gray radiation (since this only breaks for the coarse setups in our tests), but it would be good to understand this more. Did anything else (that's not reverted by could_model: "grid_scale") change in radiation recently?

Nothing else should have physical changes, but the recent refactor of rrtmgp results in numerical changes (all atmos ci and longrun passed though).

(Also this issue/PR should not affect you, as we are still using Float64 for radiation in the latest atmos release, in case that's confusing)

We haven't tested coarse resolution amip run for more than 1 day though, as that's not our focus, so it could be the atmos one is not stable as well.

szy21 · 2024-02-11T03:32:29Z

It seems there are some edge cases in SW fluxes that are wrong with Float32. When I run single_column_radiative_equilibrium_allsky_idealized_clouds locally the simulation breaks, even with smaller timesteps. When using dt: 1hours before the simulation breaks I got

julia> p.radiation.radiation_model.face_sw_flux_dn
71×1 view(::Matrix{Float32}, 1:71, :) with eltype Float32:
 2.7108308f-5
 2.7150347f-5
 2.7194521f-5
 2.7240849f-5
 2.7289268f-5
 2.733982f-5
 2.7392463f-5
 ⋮
 4.6085464f10
 4.620472f10
 4.626888f10
 4.6299476f10
 4.6311473f10
 4.631403f10

julia> p.radiation.radiation_model.face_sw_flux_up
71×1 view(::Matrix{Float32}, 1:71, :) with eltype Float32:
      1.0301155f-5
      1.0301945f-5
      1.030292f-5
      1.0304107f-5
      1.0305519f-5
      1.03071725f-5
      1.0309083f-5
      ⋮
 388453.97
 401662.0
 412144.22
 419960.9
 425306.47
      2.2088827f18

julia> p.radiation.radiation_model.weighted_irradiance
1-element Vector{Float32}:
 1408.6777

julia> p.radiation.radiation_model.cos_zenith
1-element Vector{Float32}:
 7.54979f-8

So something is clearly wrong. Maybe it has something to do with some threshold in RRTMGP? @sriharshakandala any idea?

szy21 · 2024-02-11T03:33:03Z

Maybe @tapios knows if any threshold in RRTMGP could lead to errors in SW fluxes? The problem seems to happen when cos_zenith_angle is very small. In ClimaAtmos we limit the zenith angle by max_zenith_angle = FT(π) / 2 - eps(FT).

szy21 · 2024-02-11T04:00:34Z

Ok, I don't think it has anything to do with clear-sky or all-sky. When I run single column clear-sky locally with time-varying insolation with Float32 it breaks too. It seems to me it has something to do with errors in SW fluxes when zenith angle is large. Float32 leads to very wrong SW fluxes. (When zenith angle is larger than 90 degree, cos_zenith_angle is 7.54979f-8 for Float32 and 2.83276944882399e-16 for Floa64, from the limit above, if that helps.)

It's not only the zenith angle though. When we start with large zenith angle, the simulation doesn't break immediately, so it has something to do with the atmospheric state as well.

tapios · 2024-02-11T16:05:31Z

Don't we have some catch that when abs(SZA) > pi/2, cos(SZA) = 0? If not, let's add it.

Can you please point me to the relevant pieces of code?

tapios · 2024-02-11T22:56:44Z

We should add a conditional here to make sure all fluxes are zero when zenith_angle > pi/2.

I am concerned that with our current -eps construct, we run into problems for fluxes such as here, where we have an exponential exp(-τ_sum / μ₀) that leads to wrong evaluations at float32.

This may not be the only problem, but it could be one problem.

Also, the limiting of the zenith angle should not live in callbacks in the atmosphere model, but in RRTMG. Basically, we should avoid the entire loop over spectral bands when abs(zenith_angle) >= pi/2.

szy21 · 2024-02-12T02:15:20Z

@tapios Thank you! I did a quick hack in ClimaAtmos to set all SW fluxes to zero when zenith angle is large and that seems to solve the problem: see 9679498 (The hard-coded numbers are to match our current -eps threshold). I haven't figured out which lines in RRTMGP cause the issue (exp(-τ_sum / μ₀) seems fine), but @sriharshakandala let's add this to RRTMGP?

szy21 · 2024-02-12T02:32:17Z

Also this doesn't show up in RRTMGP test cases. I think we have test cases with large zenith angle, but maybe the problem is caused by some combination of optical depth and zenith angle. Is there a way to test it in RRTMGP so we won't have similar issues in the future? Maybe we can add these edge cases we found in climaatmos to rrtmgp test cases?

sriharshakandala · 2024-02-12T03:43:14Z

Also this doesn't show up in RRTMGP test cases. I think we have test cases with large zenith angle, but maybe the problem is caused by some combination of optical depth and zenith angle. Is there a way to test it in RRTMGP so we won't have similar issues in the future? Maybe we can add these edge cases we found in climaatmos to rrtmgp test cases?

In RRTMGP, it is handled upstream in the tests, e.g. here: https://github.com/CliMA/RRTMGP.jl/blob/main/test/read_rfmip_clear_sky.jl#L38 for the clear-sky case.
For all-sky case, we just take one column with a specific zenith angle and repeat it! That's the reason we do not explicitly have this check in that test!

szy21 · 2024-02-12T13:24:17Z

It might be this line. I wonder besides setting SW fluxes to zero, should we add a minimum value for the denominator (FT(1) - Rdif * albedo_ilev)?

szy21 · 2024-02-13T14:28:27Z

I did some more tests on this. With Float32 there are issues even with cos_zenith = 1.2e-4, which corresponds to cos(pi/2 - 1000*eps(FT)). There seem to be no issues with cos_zenith = 3.5e-4, which corresponds to cos(pi/2 - sqrt(eps(FT))). So we may need to set SW fluxes to zero when zenith angle is larger than pi/2 - sqrt(eps(FT)). Right now pi/2 - eps(FT) works because we use a single column and call radiation every hour so we don't get a zenith angle that is smaller but very close to pi/2. But we may get it on the sphere when calling radiation more frequently? I think it would be good to figure out what the issue is. Or are we ok with pi/2 - sqrt(eps(FT))? @tapios @sriharshakandala What do you think?

tapios · 2024-02-13T16:05:32Z

I'm fine with having pi/2 - sqrt(eps(FT)) as the limit, but it would be nice to trace down where the problem occurs. Can you do it in the REPL, e.g., assuming zero SW opacity? Or is it dependent on SW optical depth?

szy21 · 2024-02-13T23:58:24Z

I'm fine with having pi/2 - sqrt(eps(FT)) as the limit, but it would be nice to trace down where the problem occurs. Can you do it in the REPL, e.g., assuming zero SW opacity? Or is it dependent on SW optical depth?

It is dependent on SW optical depth, as for the same zenith angle sometimes it breaks and sometimes it doesn't. I will see if I can try modifying the SW optical properties.

szy21 · 2024-02-14T02:08:35Z

We found that it's because this line can be negative with both Float32 and Float64, but it's small enough and didn't cause issues with Float64. Limiting it to zero works fine in standalone RRTMGP cases. We will test ClimaAtmos with this fix in RRTMGP.

sriharshakandala · 2024-02-14T17:14:34Z

We found that it's because this line can be negative with both Float32 and Float64, but it's small enough and didn't cause issues with Float64. Limiting it to zero works fine in standalone RRTMGP cases. We will test ClimaAtmos with this fix in RRTMGP.

Thanks @szy21 . I just added this change in PR CliMA/RRTMGP.jl#448

charleskawczynski mentioned this issue Feb 9, 2024

Fix zero q_tot edge case CliMA/Thermodynamics.jl#192

Merged

szy21 mentioned this issue Feb 12, 2024

make sure all SW fluxes are zero when zenith_angle > pi/2 CliMA/RRTMGP.jl#446

Closed

sriharshakandala assigned nefrathenrici Feb 22, 2024

nefrathenrici linked a pull request Feb 22, 2024 that will close this issue

Set FT for RRTMGP #2704

Merged

nefrathenrici closed this as completed in #2704 Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `Float32` to `Float64` conversions in `RRTMGP Interface` #2635

Remove `Float32` to `Float64` conversions in `RRTMGP Interface` #2635

sriharshakandala commented Feb 6, 2024

sriharshakandala commented Feb 6, 2024

charleskawczynski commented Feb 6, 2024

sriharshakandala commented Feb 6, 2024 •

edited

Loading

szy21 commented Feb 7, 2024

sriharshakandala commented Feb 7, 2024

szy21 commented Feb 7, 2024

charleskawczynski commented Feb 7, 2024

szy21 commented Feb 8, 2024 •

edited

Loading

LenkaNovak commented Feb 8, 2024

szy21 commented Feb 8, 2024

LenkaNovak commented Feb 8, 2024

szy21 commented Feb 8, 2024

szy21 commented Feb 8, 2024

LenkaNovak commented Feb 8, 2024

szy21 commented Feb 8, 2024

szy21 commented Feb 8, 2024

szy21 commented Feb 11, 2024 •

edited

Loading

szy21 commented Feb 11, 2024 •

edited

Loading

szy21 commented Feb 11, 2024 •

edited

Loading

tapios commented Feb 11, 2024 •

edited

Loading

tapios commented Feb 11, 2024 •

edited

Loading

szy21 commented Feb 12, 2024

szy21 commented Feb 12, 2024 •

edited

Loading

sriharshakandala commented Feb 12, 2024 •

edited

Loading

szy21 commented Feb 12, 2024

szy21 commented Feb 13, 2024 •

edited

Loading

tapios commented Feb 13, 2024

szy21 commented Feb 13, 2024

szy21 commented Feb 14, 2024 •

edited

Loading

sriharshakandala commented Feb 14, 2024

Remove Float32 to Float64 conversions in RRTMGP Interface #2635

Remove Float32 to Float64 conversions in RRTMGP Interface #2635

Comments

sriharshakandala commented Feb 6, 2024

sriharshakandala commented Feb 6, 2024

charleskawczynski commented Feb 6, 2024

sriharshakandala commented Feb 6, 2024 • edited Loading

szy21 commented Feb 7, 2024

sriharshakandala commented Feb 7, 2024

szy21 commented Feb 7, 2024

charleskawczynski commented Feb 7, 2024

szy21 commented Feb 8, 2024 • edited Loading

LenkaNovak commented Feb 8, 2024

szy21 commented Feb 8, 2024

LenkaNovak commented Feb 8, 2024

szy21 commented Feb 8, 2024

szy21 commented Feb 8, 2024

LenkaNovak commented Feb 8, 2024

szy21 commented Feb 8, 2024

szy21 commented Feb 8, 2024

szy21 commented Feb 11, 2024 • edited Loading

szy21 commented Feb 11, 2024 • edited Loading

szy21 commented Feb 11, 2024 • edited Loading

tapios commented Feb 11, 2024 • edited Loading

tapios commented Feb 11, 2024 • edited Loading

szy21 commented Feb 12, 2024

szy21 commented Feb 12, 2024 • edited Loading

sriharshakandala commented Feb 12, 2024 • edited Loading

szy21 commented Feb 12, 2024

szy21 commented Feb 13, 2024 • edited Loading

tapios commented Feb 13, 2024

szy21 commented Feb 13, 2024

szy21 commented Feb 14, 2024 • edited Loading

sriharshakandala commented Feb 14, 2024

Remove `Float32` to `Float64` conversions in `RRTMGP Interface` #2635

Remove `Float32` to `Float64` conversions in `RRTMGP Interface` #2635

sriharshakandala commented Feb 6, 2024 •

edited

Loading

szy21 commented Feb 8, 2024 •

edited

Loading

szy21 commented Feb 11, 2024 •

edited

Loading

szy21 commented Feb 11, 2024 •

edited

Loading

szy21 commented Feb 11, 2024 •

edited

Loading

tapios commented Feb 11, 2024 •

edited

Loading

tapios commented Feb 11, 2024 •

edited

Loading

szy21 commented Feb 12, 2024 •

edited

Loading

sriharshakandala commented Feb 12, 2024 •

edited

Loading

szy21 commented Feb 13, 2024 •

edited

Loading

szy21 commented Feb 14, 2024 •

edited

Loading