Skip to content

Conversation

@treigerm
Copy link
Member

Prepare the data now so that datasets without the "-tiny" suffix are the full datasets for our benchmark. The "-tiny" datasets can then be used mainly for debugging.

Here's a high-level list of what is included:

  • CAMS: 1 day (2023-06-15), all pressure levels
  • CMIP6 temperature: 1 year (2020), all pressure levels
  • CMIP6 SST: 1 year (2020)
  • ERA5: 1 day (2020-03-01)
  • Biomass: 1 year (2020, corresponds to a single time step), spatially restricted to a bounding box over mainland France (chose France based on a discussion with @milankl who said that France contains diverse land cover); the full Biomass snapshot is around 20 GB.
  • NextGEMS: 1 day (2020-03-01)

Overall, this leads to the following sizes on disk:

 13M	./datasets/cmip6-access-ta/standardized.zarr
 60M	./datasets/cams-nitrogen-dioxide/standardized.zarr
 90M	./datasets/esa-biomass-cci/standardized.zarr
189M	./datasets/nextgems-icon/standardized.zarr
202M	./datasets/era5/standardized.zarr
2.8M	./datasets/cmip6-access-tos/standardized.zarr

With a total of 556.9 MB, so this should be very feasible to download. There's two more datasets I want to add:

  • GOES-16 data with OLR measurements
  • 1 day ERA5 snapshot during Hurricane Ida (2020-08-27) with geopotential variable and wind speeds.

The question then is whether this is sufficient data but so far the consensus seems to have been that we don't need lots of data to get representative error estimate. Another thing to keep in mind is the time it takes to run the benchmark. Running all the compressors on all input data now takes around 1.5 hours on my laptop. This could easily be sped up through parallization though and in practice I would expect that most people will only run 1 compressor (the one they want to develop) on all data sources, no need for everyone to re-evaluate all the baseline compressors all the time.

@juntyr
Copy link
Collaborator

juntyr commented May 29, 2025

Thanks for the PR @treigerm! I overall agree with the changes. Can you please address my two small nits:

  • Every subsetting should have a comment on why this subset was chosen, in particular, which dimensions are included and which are subset and why. This will be especially important for the hurricane, for the others saying that the date has no significance or that France has diverse land cover would suffice

  • We should be careful about chunking for all datasets and I'd like to explicitly set them for all. All datasets are small enough that we shouldn't need chunking so explicitly setting all to one large chunk would be my preference. This is especially important as compressors often have some boundary artefacts and avoiding that entirely saves us some trouble (we can add it back later with the larger datasets where e.g. investigating consistency across time chunks would be interesting)

@treigerm treigerm changed the title Prepare data sources to Prepare data sources for paper May 30, 2025
@treigerm
Copy link
Member Author

Thanks @juntyr ! I've added comments and now forced all datasets to have only one chunk. I agree that this seems reasonable as none of the datasets are more than 200MB.

Also actually, does our compressor library automatically use the Zarr chunks when doing the compression?

@juntyr
Copy link
Collaborator

juntyr commented May 31, 2025

In https://github.com/ClimateBenchPress/compressor/blob/5b3ee73202de2ce257824171bc17afd8da34b767/src/climatebenchpress/compressor/scripts/compress.py#L141, we're using encode_decode_data_array which does operate on chunks ... perhaps I should change the name and add extra docs to make that more explicit (maybe encode_decode_chunked_data_array?)

@treigerm
Copy link
Member Author

treigerm commented Jun 2, 2025

I think changing the name might make it too verbose. But maybe adding a comment would be good to highlight that this operates on chunks. Also I can't find where the method is actually implemented or what base class it belongs to. In the script it's typed as a numcodecs.abc.Codec but that abstract base class itself doesn't have a encode_decode_data_array method (https://numcodecs.readthedocs.io/en/stable/abc.html#numcodecs.abc.Codec). Is the metho added somewhere in numcodecs-rs?

@juntyr
Copy link
Collaborator

juntyr commented Jun 2, 2025

It comes from the CodecStack in which we wrap the codec here: https://github.com/ClimateBenchPress/compressor/blob/5b3ee73202de2ce257824171bc17afd8da34b767/src/climatebenchpress/compressor/scripts/compress.py#L130-L131. For most single codecs you can just use codec.decode(codec.encode(data)) but there are edge cases and it's especially difficult once you're chaining several codecs. The stack takes care of that and handles all of them if you want to do this roundtrip

@treigerm
Copy link
Member Author

treigerm commented Jun 2, 2025

Thanks!

@treigerm
Copy link
Member Author

treigerm commented Jun 2, 2025

Is there anything that blocks from merging this PR @juntyr ? I think we can add the comments in the compress.py script in a later PR to the compressor repo.

Copy link
Collaborator

@juntyr juntyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to merge!

@treigerm treigerm merged commit 071796d into main Jun 2, 2025
4 checks passed
@treigerm treigerm deleted the fix_data branch July 31, 2025 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants