Enable "lazy_tree" for all Datamodels #358

braingram · 2024-07-14T15:58:15Z

Regtests all pass:
https://github.com/spacetelescope/RegressionTests/actions/runs/10163978364

This PR enables the asdf lazy_tree feature for roman_datamodels. By default asdf set's lazy_tree=False, this PR sets the default to True for the calls to roman_datamodels.datamodels.open.

This PR also changes a few isinstance checks to include the "lazy" nodes returned by asdf (for example, instead of a dict asdf will return a AsdfDictNode when lazy_tree=True).

Finally, this PR adds lazy=True to the roman datamodels asdf converters to signify that these converters can handle "lazy" objects.

Enabling "lazy_tree" allows asdf to defer conversion of custom objects (like astropy.unit.Quantity) until they are accessed (from the containing object). This allows asdf to also defer loading the blocks containing the array data for Quantity (this is otherwise impossible due to the handling of the input for Quantity.__init__). To provide an example, on roman_datamodels main:

>>> import roman_datamodels as rdm
>>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf")  # one of the regtest files
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
36

shows that 36 blocks were loaded from disk when the file was opened (the file contains a total of 41 blocks).
With this PR:

>>> import roman_datamodels as rdm
>>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf")  # one of the regtest files
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
2

Only 2 blocks were loaded (asdf does this as a sanity check of the file, loading the first and last block to confirm that the blocks are ordered appropriately).
Furthermore, with this PR, the above example takes 142.8M of RAM, whereas with main the example takes 486.5M.

Checklist

Added entry in CHANGES.rst under the corresponding subsection
updated relevant tests
updated relevant documentation
Passed romancal regression testing on Jenkins / PLWishMaster. Link: https://plwishmaster.stsci.edu:8081/job/RT/job/romancal/XXX/

codecov · 2024-07-14T16:01:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.64%. Comparing base (087a60d) to head (e413e49).
Report is 32 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #358      +/-   ##
==========================================
+ Coverage   97.56%   97.64%   +0.08%     
==========================================
  Files          30       36       +6     
  Lines        2788     3316     +528     
==========================================
+ Hits         2720     3238     +518     
- Misses         68       78      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

PaulHuwe

LGTM

for more information, see https://pre-commit.ci

schlafly

Wow, I'm surprised that no changes were necessary to romancal proper.

Just out of curiosity, a "block" is an entire data extension, so for files with two blocks, lazy_tree doesn't do very much, even if they're long blocks? In my imagination a big win for lazy_tree is for operations that only need to read in the metadata (operations like [fits.getheader(x)[field] for x in list_of_filenames]), and I'm trying to understand if that's still the case with files with <= 2 data extensions.

braingram · 2024-07-31T15:51:27Z

Wow, I'm surprised that no changes were necessary to romancal proper.

There are some follow-up changes that are worth considering related to removing lazy_load usage:
https://github.com/search?q=repo%3Aspacetelescope%2Fromancal%20lazy_load&type=code
as that largely defeats the IO benefits from lazy_tree (more on that below).

Just out of curiosity, a "block" is an entire data extension, so for files with two blocks, lazy_tree doesn't do very much, even if they're long blocks? In my imagination a big win for lazy_tree is for operations that only need to read in the metadata (operations like [fits.getheader(x)[field] for x in list_of_filenames]), and I'm trying to understand if that's still the case with files with <= 2 data extensions.

Thanks for the question. I glossed over some details when referring to which blocks were "loaded". There is still a benefit for files with <= 2 data extensions as the above used block.loaded property refers to the ASDF block header (not the block data).

To try and add some more (but hopefully not too much) detail. Each ASDF block has a binary header and data. When a file is opened, asdf reads the headers for the first and last blocks but typically does not load the data (when lazy_load=True). However, when the data is accessed (or the converter forces the loading in the case of astropy.Quantity) asdf will read and cache the block data (for the block referred to by the array). For this PR since we're enabling lazy_tree (and lazy_load is on by default) opening a roman file with the default arguments will:

>>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf")
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
2

Load the first and last block headers. However the data portion of the blocks are not yet read (this uses more private asdf API as the public API for interfacing with the blocks is within the extension code and would complicate the example here).

>>> m._asdf._blocks._blocks[0]._cached_data
None

For this particular file, the 0th block corresponds to coefficients of a polynomial transform in the gwcs object. So if we access the wcs it will trigger loading the block data:

>>> m.meta.wcs
>>> m._asdf._blocks._blocks[0]._cached_data
array([...])

Finally, to circle back to the relationship with lazy_load. If a file is opened with lazy_load=False every block header and block data will be loaded when the file is first opened. Using the above example:

>>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf", lazy_load=False)
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
41
>>> m._asdf._blocks._blocks[0]._data  # returns an array for every block in the file
array([...])

So for the most part using lazy_load cancels any benefits from lazy_tree. Let me know if more details would be helpful and if you have any questions.

schlafly · 2024-07-31T16:20:40Z

Ah, right, you had mentioned to me before that we had largely disabled lazy_load in the pipeline anyway; I agree that we can now try to do that better.

And yes, thank you for the clear description of what it means to load the first and last blocks. If I understand correctly, there's likely a small penalty related to doing the seek to get the last block header, but it's small and there's no associated memory usage penalty; that sounds like a reasonable trade.

braingram mentioned this pull request Jul 16, 2024

Replace ModelContainer with ModelLibrary spacetelescope/romancal#1241

Merged

16 tasks

braingram force-pushed the lazy branch from 7434eed to 7fd58e6 Compare July 30, 2024 14:03

braingram marked this pull request as ready for review July 30, 2024 14:39

braingram requested a review from a team as a code owner July 30, 2024 14:39

braingram requested a review from schlafly July 30, 2024 14:40

PaulHuwe approved these changes Jul 31, 2024

View reviewed changes

braingram and others added 9 commits July 31, 2024 10:45

allow lazy nodes

7da1fa6

TST: enable lazy_tree as default

6be8dc2

convert to dict on serialization

cc8b2e7

allow lazy nodes during flat_dict

cc5b4b6

fix append_individual_image_metadata for lazy nodes

414c93d

update asdf pin

37eddf2

[pre-commit.ci] auto fixes from pre-commit.com hooks

9553b01

for more information, see https://pre-commit.ci

update deps for oldestdeps

d14c4c2

add changelog

e413e49

braingram force-pushed the lazy branch from 402f9ef to e413e49 Compare July 31, 2024 14:45

schlafly approved these changes Jul 31, 2024

View reviewed changes

braingram merged commit 911485a into spacetelescope:main Jul 31, 2024
16 checks passed

braingram deleted the lazy branch July 31, 2024 15:52

braingram mentioned this pull request Jul 31, 2024

Investigate need for lazy_loading asdf files spacetelescope/romancal#1341

Closed

braingram mentioned this pull request Jul 31, 2024

increase asdf minimum required version spacetelescope/romancal#1343

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable "lazy_tree" for all Datamodels #358

Enable "lazy_tree" for all Datamodels #358

braingram commented Jul 14, 2024 •

edited

Loading

codecov bot commented Jul 14, 2024 •

edited

Loading

PaulHuwe left a comment

schlafly left a comment

braingram commented Jul 31, 2024

schlafly commented Jul 31, 2024

Enable "lazy_tree" for all Datamodels #358

Enable "lazy_tree" for all Datamodels #358

Conversation

braingram commented Jul 14, 2024 • edited Loading

codecov bot commented Jul 14, 2024 • edited Loading

Codecov Report

PaulHuwe left a comment

Choose a reason for hiding this comment

schlafly left a comment

Choose a reason for hiding this comment

braingram commented Jul 31, 2024

schlafly commented Jul 31, 2024

braingram commented Jul 14, 2024 •

edited

Loading

codecov bot commented Jul 14, 2024 •

edited

Loading