Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable "lazy_tree" for all Datamodels #358

Merged
merged 9 commits into from
Jul 31, 2024
Merged

Conversation

braingram
Copy link
Collaborator

@braingram braingram commented Jul 14, 2024

Regtests all pass:
https://github.com/spacetelescope/RegressionTests/actions/runs/10163978364

This PR enables the asdf lazy_tree feature for roman_datamodels. By default asdf set's lazy_tree=False, this PR sets the default to True for the calls to roman_datamodels.datamodels.open.

This PR also changes a few isinstance checks to include the "lazy" nodes returned by asdf (for example, instead of a dict asdf will return a AsdfDictNode when lazy_tree=True).

Finally, this PR adds lazy=True to the roman datamodels asdf converters to signify that these converters can handle "lazy" objects.

Enabling "lazy_tree" allows asdf to defer conversion of custom objects (like astropy.unit.Quantity) until they are accessed (from the containing object). This allows asdf to also defer loading the blocks containing the array data for Quantity (this is otherwise impossible due to the handling of the input for Quantity.__init__). To provide an example, on roman_datamodels main:

>>> import roman_datamodels as rdm
>>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf")  # one of the regtest files
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
36

shows that 36 blocks were loaded from disk when the file was opened (the file contains a total of 41 blocks).
With this PR:

>>> import roman_datamodels as rdm
>>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf")  # one of the regtest files
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
2

Only 2 blocks were loaded (asdf does this as a sanity check of the file, loading the first and last block to confirm that the blocks are ordered appropriately).
Furthermore, with this PR, the above example takes 142.8M of RAM, whereas with main the example takes 486.5M.

Checklist

Copy link

codecov bot commented Jul 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.64%. Comparing base (087a60d) to head (e413e49).
Report is 32 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #358      +/-   ##
==========================================
+ Coverage   97.56%   97.64%   +0.08%     
==========================================
  Files          30       36       +6     
  Lines        2788     3316     +528     
==========================================
+ Hits         2720     3238     +518     
- Misses         68       78      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@PaulHuwe PaulHuwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@schlafly schlafly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I'm surprised that no changes were necessary to romancal proper.

Just out of curiosity, a "block" is an entire data extension, so for files with two blocks, lazy_tree doesn't do very much, even if they're long blocks? In my imagination a big win for lazy_tree is for operations that only need to read in the metadata (operations like [fits.getheader(x)[field] for x in list_of_filenames]), and I'm trying to understand if that's still the case with files with <= 2 data extensions.

@braingram
Copy link
Collaborator Author

Wow, I'm surprised that no changes were necessary to romancal proper.

There are some follow-up changes that are worth considering related to removing lazy_load usage:
https://github.com/search?q=repo%3Aspacetelescope%2Fromancal%20lazy_load&type=code
as that largely defeats the IO benefits from lazy_tree (more on that below).

Just out of curiosity, a "block" is an entire data extension, so for files with two blocks, lazy_tree doesn't do very much, even if they're long blocks? In my imagination a big win for lazy_tree is for operations that only need to read in the metadata (operations like [fits.getheader(x)[field] for x in list_of_filenames]), and I'm trying to understand if that's still the case with files with <= 2 data extensions.

Thanks for the question. I glossed over some details when referring to which blocks were "loaded". There is still a benefit for files with <= 2 data extensions as the above used block.loaded property refers to the ASDF block header (not the block data).

To try and add some more (but hopefully not too much) detail. Each ASDF block has a binary header and data. When a file is opened, asdf reads the headers for the first and last blocks but typically does not load the data (when lazy_load=True). However, when the data is accessed (or the converter forces the loading in the case of astropy.Quantity) asdf will read and cache the block data (for the block referred to by the array). For this PR since we're enabling lazy_tree (and lazy_load is on by default) opening a roman file with the default arguments will:

>>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf")
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
2

Load the first and last block headers. However the data portion of the blocks are not yet read (this uses more private asdf API as the public API for interfacing with the blocks is within the extension code and would complicate the example here).

>>> m._asdf._blocks._blocks[0]._cached_data
None

For this particular file, the 0th block corresponds to coefficients of a polynomial transform in the gwcs object. So if we access the wcs it will trigger loading the block data:

>>> m.meta.wcs
>>> m._asdf._blocks._blocks[0]._cached_data
array([...])

Finally, to circle back to the relationship with lazy_load. If a file is opened with lazy_load=False every block header and block data will be loaded when the file is first opened. Using the above example:

>>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf", lazy_load=False)
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
41
>>> m._asdf._blocks._blocks[0]._data  # returns an array for every block in the file
array([...])

So for the most part using lazy_load cancels any benefits from lazy_tree. Let me know if more details would be helpful and if you have any questions.

@braingram braingram merged commit 911485a into spacetelescope:main Jul 31, 2024
16 checks passed
@braingram braingram deleted the lazy branch July 31, 2024 15:52
@schlafly
Copy link
Collaborator

Ah, right, you had mentioned to me before that we had largely disabled lazy_load in the pipeline anyway; I agree that we can now try to do that better.

And yes, thank you for the clear description of what it means to load the first and last blocks. If I understand correctly, there's likely a small penalty related to doing the seek to get the last block header, but it's small and there's no associated memory usage penalty; that sounds like a reasonable trade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants