Skip to content

Proposal for is_cloud_optimized#145

Merged
valeriupredoi merged 5 commits intoNCAS-CMS:mainfrom
zequihg50:is_cloud_optim
Nov 12, 2025
Merged

Proposal for is_cloud_optimized#145
valeriupredoi merged 5 commits intoNCAS-CMS:mainfrom
zequihg50:is_cloud_optim

Conversation

@zequihg50
Copy link
Contributor

The new API introduced in #138 allows for a mostly trivial implementation of is_cloud_optimized. This could serve to validate files that have been repacked. This is just a proposal, if this looks correct I can continue providing a test for it. We should maybe take into account other factors such as not chunked variables.

@codecov
Copy link

codecov bot commented Nov 5, 2025

Codecov Report

❌ Patch coverage is 81.48148% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.24%. Comparing base (bac7664) to head (c315877).
⚠️ Report is 199 commits behind head on main.

Files with missing lines Patch % Lines
pyfive/high_level.py 72.22% 0 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #145      +/-   ##
==========================================
+ Coverage   76.21%   76.24%   +0.03%     
==========================================
  Files          14       14              
  Lines        2867     2888      +21     
  Branches      450      459       +9     
==========================================
+ Hits         2185     2202      +17     
  Misses        561      561              
- Partials      121      125       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zequihg50
Copy link
Contributor Author

In thinking some more about this I have realized the following:

  • h5d first_chunk - This relied on the first "logical chunk" (i.e. chunk with logical id 0) but that chunk may not be the first in the file. I have a file from CORDEX that illustrates this. This needs a fix for Optimise when we get access to b-tree by providing lazier view of datasets, access to b-tree location, and new p5dump #138. That last commit addresses this.
  • For a file to be "repacked" or "cloud optimized" it is not enough to check if for a given variable its first chunk is located after btree nodes. It needs to be checked that the min offset of all chunks of all variables is located after all btree nodes for all variables. That last commit addresses this.

@zequihg50
Copy link
Contributor Author

We should think of a better name than is_cloud_optimized. This functionality checks if the file has been cmip7repacked,which can indeed be considered cloud optimized. However, there are other mechanisms that could also qualify as cloud optimization, for example, repacking with pages. Therefore, I’m thinking is_metadata_repacked might be a clearer and more accurate name than is_cloud_optimized.

@davidhassell
Copy link
Collaborator

Ezequiel and I chatted about this today, and I agree that a is_metadata_repacked method is a good way to go, where is_metadata_repacked checks if all B-tree offsets are before all data chunks offsets.

@kmuehlbauer
Copy link
Collaborator

kmuehlbauer commented Nov 6, 2025

repacked sounds like the file has to be treated by some means. Can a file be created which does return True with this check, which hasn't been repacked? Just asking...

@zequihg50
Copy link
Contributor Author

zequihg50 commented Nov 6, 2025

Yes, it is theoretically possible for a file to meet the criteria of cloud optimized without being "repacked" or processed in any way, although I would say this does not happen much in practice, due to how files are written. I agree that the name for this requires some thought. In the meantime, I have added a test using the is_cloud_optimized nomenclature, @davidhassell would you review this?

@kmuehlbauer
Copy link
Collaborator

Thanks @zequihg50 for the details, much appreciated.

@davidhassell
Copy link
Collaborator

I agree that the "re" in "repacked" isn't ideal, as you could indeed get a nice file without post processing it.

I not too keen on "is_cloud_optimised", as a file with all of its internal metadata up front can still be terrible in that respect (if it has a large number of chunks).

I'll go and look at the PR, now

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers, @zequihg50

@davidhassell
Copy link
Collaborator

Hi - there are no objections (yet) to renaming the property consolidated_metadata, so @zequihg50, perhaps you could move to make that change (with the docstring) we can then get v1.0 out ...

Cheers,
David

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great.

@davidhassell
Copy link
Collaborator

davidhassell commented Nov 12, 2025

Hi @zequihg50 - we need a change log entry, too :) Thanks

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Nov 12, 2025

Hi @zequihg50 - we need a change log entry, too :) Thanks

I'll pop one in imminently, after I have merged it, and while I'll be in the process of releasing 1.0

@valeriupredoi
Copy link
Collaborator

thanks @zequihg50 and @davidhassell 🍻

@valeriupredoi valeriupredoi merged commit 13a95c3 into NCAS-CMS:main Nov 12, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants