Prepare data sources for paper #22

treigerm · 2025-05-29T15:04:30Z

Prepare the data now so that datasets without the "-tiny" suffix are the full datasets for our benchmark. The "-tiny" datasets can then be used mainly for debugging.

Here's a high-level list of what is included:

CAMS: 1 day (2023-06-15), all pressure levels
CMIP6 temperature: 1 year (2020), all pressure levels
CMIP6 SST: 1 year (2020)
ERA5: 1 day (2020-03-01)
Biomass: 1 year (2020, corresponds to a single time step), spatially restricted to a bounding box over mainland France (chose France based on a discussion with @milankl who said that France contains diverse land cover); the full Biomass snapshot is around 20 GB.
NextGEMS: 1 day (2020-03-01)

Overall, this leads to the following sizes on disk:

 13M	./datasets/cmip6-access-ta/standardized.zarr
 60M	./datasets/cams-nitrogen-dioxide/standardized.zarr
 90M	./datasets/esa-biomass-cci/standardized.zarr
189M	./datasets/nextgems-icon/standardized.zarr
202M	./datasets/era5/standardized.zarr
2.8M	./datasets/cmip6-access-tos/standardized.zarr

With a total of 556.9 MB, so this should be very feasible to download. There's two more datasets I want to add:

GOES-16 data with OLR measurements
1 day ERA5 snapshot during Hurricane Ida (2020-08-27) with geopotential variable and wind speeds.

The question then is whether this is sufficient data but so far the consensus seems to have been that we don't need lots of data to get representative error estimate. Another thing to keep in mind is the time it takes to run the benchmark. Running all the compressors on all input data now takes around 1.5 hours on my laptop. This could easily be sped up through parallization though and in practice I would expect that most people will only run 1 compressor (the one they want to develop) on all data sources, no need for everyone to re-evaluate all the baseline compressors all the time.

juntyr · 2025-05-29T15:40:04Z

Thanks for the PR @treigerm! I overall agree with the changes. Can you please address my two small nits:

Every subsetting should have a comment on why this subset was chosen, in particular, which dimensions are included and which are subset and why. This will be especially important for the hurricane, for the others saying that the date has no significance or that France has diverse land cover would suffice
We should be careful about chunking for all datasets and I'd like to explicitly set them for all. All datasets are small enough that we shouldn't need chunking so explicitly setting all to one large chunk would be my preference. This is especially important as compressors often have some boundary artefacts and avoiding that entirely saves us some trouble (we can add it back later with the larger datasets where e.g. investigating consistency across time chunks would be interesting)

treigerm · 2025-05-30T15:08:50Z

Thanks @juntyr ! I've added comments and now forced all datasets to have only one chunk. I agree that this seems reasonable as none of the datasets are more than 200MB.

Also actually, does our compressor library automatically use the Zarr chunks when doing the compression?

juntyr · 2025-05-31T09:12:04Z

In https://github.com/ClimateBenchPress/compressor/blob/5b3ee73202de2ce257824171bc17afd8da34b767/src/climatebenchpress/compressor/scripts/compress.py#L141, we're using encode_decode_data_array which does operate on chunks ... perhaps I should change the name and add extra docs to make that more explicit (maybe encode_decode_chunked_data_array?)

treigerm · 2025-06-02T08:00:21Z

I think changing the name might make it too verbose. But maybe adding a comment would be good to highlight that this operates on chunks. Also I can't find where the method is actually implemented or what base class it belongs to. In the script it's typed as a numcodecs.abc.Codec but that abstract base class itself doesn't have a encode_decode_data_array method (https://numcodecs.readthedocs.io/en/stable/abc.html#numcodecs.abc.Codec). Is the metho added somewhere in numcodecs-rs?

juntyr · 2025-06-02T09:16:16Z

It comes from the CodecStack in which we wrap the codec here: https://github.com/ClimateBenchPress/compressor/blob/5b3ee73202de2ce257824171bc17afd8da34b767/src/climatebenchpress/compressor/scripts/compress.py#L130-L131. For most single codecs you can just use codec.decode(codec.encode(data)) but there are edge cases and it's especially difficult once you're chaining several codecs. The stack takes care of that and handles all of them if you want to do this roundtrip

treigerm · 2025-06-02T13:04:01Z

Thanks!

treigerm · 2025-06-02T13:05:53Z

Is there anything that blocks from merging this PR @juntyr ? I think we can add the comments in the compress.py script in a later PR to the compressor repo.

juntyr

LGTM, feel free to merge!

treigerm added 5 commits May 28, 2025 13:33

Restrict ERA5 to one day

6056b02

Restrict nextgems data to one day

11ca8a7

Make CAMS data contain only a single day

8c489cc

Restrict standardized biomass dataset to mainland France

c5d8c28

Restrict CMIP6 data to only a single year

694af89

treigerm changed the title ~~Prepare data sources to~~ Prepare data sources for paper May 30, 2025

treigerm added 2 commits May 30, 2025 14:50

Add comments

48e364c

Ensure single chunk for all datasets

4757a53

juntyr approved these changes Jun 2, 2025

View reviewed changes

treigerm merged commit 071796d into main Jun 2, 2025
4 checks passed

treigerm deleted the fix_data branch July 31, 2025 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prepare data sources for paper #22

Prepare data sources for paper #22

Uh oh!

treigerm commented May 29, 2025

Uh oh!

juntyr commented May 29, 2025 •

edited

Loading

Uh oh!

treigerm commented May 30, 2025

Uh oh!

juntyr commented May 31, 2025

Uh oh!

treigerm commented Jun 2, 2025

Uh oh!

juntyr commented Jun 2, 2025

Uh oh!

treigerm commented Jun 2, 2025

Uh oh!

treigerm commented Jun 2, 2025

Uh oh!

juntyr left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Prepare data sources for paper #22

Prepare data sources for paper #22

Uh oh!

Conversation

treigerm commented May 29, 2025

Uh oh!

juntyr commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

treigerm commented May 30, 2025

Uh oh!

juntyr commented May 31, 2025

Uh oh!

treigerm commented Jun 2, 2025

Uh oh!

juntyr commented Jun 2, 2025

Uh oh!

treigerm commented Jun 2, 2025

Uh oh!

treigerm commented Jun 2, 2025

Uh oh!

juntyr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juntyr commented May 29, 2025 •

edited

Loading