-
Notifications
You must be signed in to change notification settings - Fork 0
Prepare data sources for paper #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for the PR @treigerm! I overall agree with the changes. Can you please address my two small nits:
|
|
Thanks @juntyr ! I've added comments and now forced all datasets to have only one chunk. I agree that this seems reasonable as none of the datasets are more than 200MB. Also actually, does our compressor library automatically use the Zarr chunks when doing the compression? |
|
In https://github.com/ClimateBenchPress/compressor/blob/5b3ee73202de2ce257824171bc17afd8da34b767/src/climatebenchpress/compressor/scripts/compress.py#L141, we're using |
|
I think changing the name might make it too verbose. But maybe adding a comment would be good to highlight that this operates on chunks. Also I can't find where the method is actually implemented or what base class it belongs to. In the script it's typed as a |
|
It comes from the |
|
Thanks! |
|
Is there anything that blocks from merging this PR @juntyr ? I think we can add the comments in the |
juntyr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to merge!
Prepare the data now so that datasets without the "-tiny" suffix are the full datasets for our benchmark. The "-tiny" datasets can then be used mainly for debugging.
Here's a high-level list of what is included:
Overall, this leads to the following sizes on disk:
With a total of 556.9 MB, so this should be very feasible to download. There's two more datasets I want to add:
The question then is whether this is sufficient data but so far the consensus seems to have been that we don't need lots of data to get representative error estimate. Another thing to keep in mind is the time it takes to run the benchmark. Running all the compressors on all input data now takes around 1.5 hours on my laptop. This could easily be sped up through parallization though and in practice I would expect that most people will only run 1 compressor (the one they want to develop) on all data sources, no need for everyone to re-evaluate all the baseline compressors all the time.