Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary duplicate arrays when using asdf embedded in FITS files #1232

Closed
perrygreenfield opened this issue Nov 15, 2022 · 10 comments · Fixed by #1234
Closed

Unnecessary duplicate arrays when using asdf embedded in FITS files #1232

perrygreenfield opened this issue Nov 15, 2022 · 10 comments · Fixed by #1234
Assignees
Labels
Milestone

Comments

@perrygreenfield
Copy link
Contributor

JP-2900 reports a failure of calwebb_spec3 when it tries to create source-based products containing slit cutouts from multiple exposures. In one case the asdf extension grew to larger than 4GB, exceeding the size that astropy.io.fits can handle for a binary table entry. This revealed two problems. First, the pipeline was generating cutouts that made no sense (with vertical sizes in pixel in the hundreds) illustrating a problem with the segmentation algorithm and the lack of a sensible filter to reject cases of considerable confusion of sources. This led to trying to stuff many large cutouts into one file. Second, and the one related to a problem with ASDF, is that the same cutout array appeared both in a FITS HDU, and in the ASDF content contained in the ASDF HDU, meaning that data was unnecessarily duplicated in the file.

@braingram
Copy link
Contributor

Is there a minimal case for the asdf portion of this issue?

I'm looking at the jwst.datamodels save code and it appears to be using the asdf validation step to create the hdu list based on the model schema. This step mentions the MultiExposureModel which includes some custom schema modification code.

Unfortunately, if I generate a MultiExposureModel, add an Image as an exposure and save it I am not seeing duplication of the data.

@perrygreenfield
Copy link
Contributor Author

Read the custom schema modification code and tell me what you think it actually does.

What was the actual type of that image?

@braingram
Copy link
Contributor

braingram commented Nov 15, 2022

Here's the test I ran:
https://gist.github.com/braingram/1e7505fd98f66fc6e0c5b8852bdda2eb

It looks like it's taking the core schema meta data stripping all fits_keyword and fits_hdu key/value pairs , then copying it (overlaying it) into exposures.meta of the multiexposure schema. This sort of matches the first sentence of the preceding comment but I don't follow the second sentence.

        # Create a new core.meta that will co-locate
        # with each exposure entry. This is done
        # by saving the meta information in a separate
        # FITS HDU.

Is the duplicated data under the meta key of the exposures or somewhere else in the file?

@braingram
Copy link
Contributor

Here's a minimal case without jwst

import asdf
import asdf.fits_embed
import astropy.io.fits
import numpy


arr = numpy.zeros((1000, 100), dtype='f4')
# make hdu list with array from tree
hdulist = astropy.io.fits.HDUList()
hdulist.append(astropy.io.fits.PrimaryHDU())
hdu = astropy.io.fits.ImageHDU(name='SCI')
# assign data after creation to avoid a copy
hdu.data = arr
hdulist.append(hdu)

# save tree and hdu list
aif = asdf.fits_embed.AsdfInFits(hdulist, {'arr': arr})
aif.write_to('test1.fits', overwrite=True)

# read in asdf in fits
aif2 = asdf.open('test1.fits', mode='r')
aif2.write_to('test2.fits', overwrite=True)

ff1 = astropy.io.fits.open('test1.fits')
ff2 = astropy.io.fits.open('test2.fits')
print(f"original size = {ff1[-1].size}")
print(f"resave size   = {ff2[-1].size}")

This produces the following output:

original size = 571
resave size   = 400616

Presumably we do not want to lose links between arrays in the tree and HDU items (is this the right term?). This could be made into a test to add to test_fits_embed.

Per some emails with @perrygreenfield this issue appears to be related to linking HDUs and arrays and can be improved by changing find_or_create_block_for_array in EmbeddedBlockManager in fits_embed and util.get_array_base which both don't handle NDArrayType.

diff --git a/asdf/fits_embed.py b/asdf/fits_embed.py
index b436281..093b416 100644
--- a/asdf/fits_embed.py
+++ b/asdf/fits_embed.py
@@ -92,7 +92,9 @@ class _EmbeddedBlockManager(block.BlockManager):
     def find_or_create_block_for_array(self, arr, ctx):
         from .tags.core import ndarray

-        if not isinstance(arr, ndarray.NDArrayType):
+        #import pdb; pdb.set_trace()
+        #if not isinstance(arr, ndarray.NDArrayType):
+        if True:
             base = util.get_array_base(arr)
             for hdu in self._hdulist:
                 if hdu.data is None:
diff --git a/asdf/util.py b/asdf/util.py
index a45c863..a615da4 100644
--- a/asdf/util.py
+++ b/asdf/util.py
@@ -85,8 +85,9 @@ def get_array_base(arr):
     For a given Numpy array, finds the base array that "owns" the
     actual data.
     """
+    from .tags.core import ndarray
     base = arr
-    while isinstance(base.base, np.ndarray):
+    while isinstance(base.base, (np.ndarray, ndarray.NDArrayType)):
         base = base.base
     return base

My original thoughts on using numpy.shares_memory instead of comparing base aren't compatible with the test_array_view_different_layout added in PR #930 to avoid writing invalid data in fits. I haven't replicated this but the diff above (which doesn't use shares_memory) doesn't cause the test to fail.

@braingram
Copy link
Contributor

Duplication can also occur if the data attribute of an hdu item is set to a non-ndarray type object. This will result in a copy (for example, assigning an NDArrayType to hdu.data will result in a copy) and data will then be duplicated as the link between the item in the hdulist and the array in the tree will be broken.

braingram added a commit to braingram/asdf that referenced this issue Nov 16, 2022
NDArrayType fails checks to isinstance(arr, ndarray) as it is not a
subclass of ndarray. This prevented util.get_array_base from finding
the correct base array which contributed to failures to link hdu
items to arrays within the tree (leading to duplication of data).

The EmbeddedBlockManager in fits_embed also skipped checking for
links between NDArrayType and hdu items which also broke the link
between hdu items and arrays within the tree. This PR removes the
skip.

fixes issue asdf-format#1232
@jdavies-st
Copy link
Contributor

jdavies-st commented Nov 17, 2022

Here's another minimal case indicating that it seems (I think) to be coming from asdf, independent of anything in jwst.datamodels or stdatamodels:

from jwst import datamodels
from asdf.fits_embed import AsdfInFits
from astropy.io import fits


dm = datamodels.ImageModel((2048, 2048))
dm.save("test.fits")
with fits.open("test.fits") as hdulist:
    print(hdulist.info())

af = AsdfInFits.open("test.fits")
af.write_to("test2.fits")
with fits.open("test2.fits") as hdulist:
    print(hdulist.info())

and one gets as output

Filename: test.fits
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       7   ()      
  1  SCI           1 ImageHDU         9   (2048, 2048)   float32   
  2  ASDF          1 BinTableHDU     11   1R x 1C   [683B]   

Filename: test2.fits
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       7   ()      
  1  SCI           1 ImageHDU         9   (2048, 2048)   float32   
  2  ASDF          1 BinTableHDU     11   1R x 1C   [16777944B]   

Note that initializing the AsdfInFITS object with a FITS file bypasses anything that jwst.datamodels and stdatamodels might be doing in terms of validation or shuffling. I think.

EDIT: oops, I realized that above there's already a minimal case with jwst. 😅

@braingram
Copy link
Contributor

Thanks! Confirmation is always good :)

Here's another minimal case using stdatamodels.

import astropy.io.fits
import stdatamodels
import numpy

# --- this bit here replicates test infrastructure already in stdatamodels ---
cfg = asdf.get_config()
path_to_stdatamodels_repo = FILL_THIS_IN_FOR_YOUR_CHECKOUT
schema_dir = f"{path_to_stdatamodels_repo}/tests/schemas"
cfg.add_resource_mapping(asdf.resource.DirectoryResourceMapping(
    schema_dir, "http://example.com/schemas"))

class FitsModel(stdatamodels.DataModel):
    schema_url = "http://example.com/schemas/fits_model"
# --- end of replication --

arr = numpy.zeros((1000, 100), dtype='f4')
m = FitsModel(arr)
m.save('test1.fits')

m2 = FitsModel.from_fits('test1.fits')
m2.save('test2.fits')

ff1 = astropy.io.fits.open('test1.fits')
ff2 = astropy.io.fits.open('test2.fits')
print(f"original size = {ff1[-1].size}")
print(f"resave size   = {ff2[-1].size}")

I believe the fix will have to occur in ASDF (with discussion commencing on this PR #1233) but adding these cases as tests to stdatamodels and perhaps also jwst would help to catch any regressions. I will open PRs with tests for jwst and datamodels (with tests that currently fail and will be marked as skip). Once we have a fix we can unskip these (or wait to accept the PRs until the fix is available).

braingram added a commit to braingram/stdatamodels that referenced this issue Nov 17, 2022
See asdf issue asdf-format/asdf#1232

This test is currently skipped as it will fail with the current
asdf version.
braingram added a commit to braingram/jwst that referenced this issue Nov 17, 2022
When resaving a model to fits, asdf loses track of array-hdu
links and saves arrays in both hdu and as internal blocks in the
asdf tree. This can make an asdf tree that is larger than what
can be stored in a fits hdu.

See asdf issue: asdf-format/asdf#1232

This test is skipped as it fails with the current asdf version.
@perrygreenfield
Copy link
Contributor Author

I'm trying to think of a case where an inadvertent reference to an HDU array would happen. I'm asking if any JWST products appear with identical DQ or ERR arrays (e.g., with constant values, or identical defaults). That may affect whether this is an issue in the only case this stuff is likely to be used for.

@perrygreenfield
Copy link
Contributor Author

Even if this leads to multiple ASDF attributes pointing to the same HDU (with other HDUs having the same array contents that really should have matched), and some of those others not actually being used, I wonder if that really is a problem. The point of the FITS files is providing users with something they are comfortable with and know how to interpret. In most cases they would ignore the ADSF content. And for code that does use the ASDF content, it shouldn't care where the array values came from in the FITS file. The main worry with the ASDF is if different components are sharing the same array when they shouldn't. In most (and hopefully all) cases I think tasks make copies of the arrays, but I'm not completely sure of that. Another question for Howard.

@perrygreenfield
Copy link
Contributor Author

I asked Howard the following questions:

  1. We are looking more at this issue with duplicated arrays. Do products ever get generated with identical default DQ and/or ERR arrays within the same file? Or are they generally written out with populated results when such arrays appear?
  2. A related question: are all data read from FITS files copied before modification? (Some of the solutions being considered may end up with ASDF elements sharing the data from the same HDU when the file is read). Would that cause a problem?

And Howard's responses:

As to the first question, the DQ arrays get populated with something non-default in like the very first step of the pipeline and proceed with more updates from there. So I doubt there are ever any products where multiple DQ arrays would have all default (e.g. zero) values. ERR arrays don't get populated until the ramp_fit step, which is the end of Stage 1 processing (rate products). So it only be intermediate products saved before that stage (e.g. our 4D "ramp" product) that might still have all default ERR array values. But each product would only have 1 ERR array. It isn't until mid-way through Stage 2 processing that we ever get products containing multiple DQ or ERR arrays, which occurs for spectroscopic modes that have multiple sources split out from the original single image. But by that time the DQ and ERR arrays should never still be at default values.

Most of the time we have steps make a copy of the input data before making changes. Originally we had things make corrections in-place, just to save memory (i.e. no need to have 2 copies of the data in memory all the time). That's fine when running a default pipeline end-to-end from the command line, because you get the correct final results in the output files, but when running interactively from within python, it's not always desirable. Users may not want to have their inputs tampered with, especially if they want to run a step multiple times to tweak params or whatever.

braingram added a commit to braingram/jwst that referenced this issue Nov 18, 2022
When resaving a model to fits, asdf loses track of array-hdu
links and saves arrays in both hdu and as internal blocks in the
asdf tree. This can make an asdf tree that is larger than what
can be stored in a fits hdu.

See asdf issue: asdf-format/asdf#1232

This test is skipped as it fails with the current asdf version.

modify imports and use context manager for test_fits

Co-authored-by: James Davies <jdavies@mpia.de>

use context manager in test_fits

Co-authored-by: James Davies <jdavies@mpia.de>

improve test docstring in test_fits
braingram added a commit to braingram/jwst that referenced this issue Nov 22, 2022
When resaving a model to fits, asdf loses track of array-hdu
links and saves arrays in both hdu and as internal blocks in the
asdf tree. This can make an asdf tree that is larger than what
can be stored in a fits hdu.

See asdf issue: asdf-format/asdf#1232

This test is skipped as it fails with the current asdf version.

modify imports and use context manager for test_fits

Co-authored-by: James Davies <jdavies@mpia.de>

use context manager in test_fits

Co-authored-by: James Davies <jdavies@mpia.de>

improve test docstring in test_fits
braingram added a commit to braingram/asdf that referenced this issue Nov 22, 2022
NDArrayType fails checks to isinstance(arr, ndarray) as it is not a
subclass of ndarray. This prevented util.get_array_base from finding
the correct base array which contributed to failures to link hdu
items to arrays within the tree (leading to duplication of data).

The EmbeddedBlockManager in fits_embed also skipped checking for
links between NDArrayType and hdu items which also broke the link
between hdu items and arrays within the tree. This PR removes the
skip.

fixes issue asdf-format#1232
braingram added a commit to braingram/jwst that referenced this issue Nov 29, 2022
When resaving a model to fits, asdf loses track of array-hdu
links and saves arrays in both hdu and as internal blocks in the
asdf tree. This can make an asdf tree that is larger than what
can be stored in a fits hdu.

See asdf issue: asdf-format/asdf#1232

This test is skipped as it fails with the current asdf version.

modify imports and use context manager for test_fits

Co-authored-by: James Davies <jdavies@mpia.de>

use context manager in test_fits

Co-authored-by: James Davies <jdavies@mpia.de>

improve test docstring in test_fits
nden pushed a commit to spacetelescope/jwst that referenced this issue Nov 29, 2022
* Add test for duplicate data when resaving model to fits

When resaving a model to fits, asdf loses track of array-hdu
links and saves arrays in both hdu and as internal blocks in the
asdf tree. This can make an asdf tree that is larger than what
can be stored in a fits hdu.

See asdf issue: asdf-format/asdf#1232

This test is skipped as it fails with the current asdf version.

modify imports and use context manager for test_fits

Co-authored-by: James Davies <jdavies@mpia.de>

use context manager in test_fits

Co-authored-by: James Davies <jdavies@mpia.de>

improve test docstring in test_fits

* bump asdf version to 2.14.1 to fix hdu data

unskip test for issue #7354

fixes #7354
nden pushed a commit to spacetelescope/stdatamodels that referenced this issue Nov 29, 2022
* Add test for asdf data duplication issue with fits

See asdf issue asdf-format/asdf#1232

This test is currently skipped as it will fail with the current
asdf version.

* bump asdf to 2.14.1 to fix hdu data duplication

* update changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment