-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary duplicate arrays when using asdf embedded in FITS files #1232
Comments
Is there a minimal case for the asdf portion of this issue? I'm looking at the jwst.datamodels save code and it appears to be using the asdf validation step to create the hdu list based on the model schema. This step mentions the MultiExposureModel which includes some custom schema modification code. Unfortunately, if I generate a MultiExposureModel, add an Image as an exposure and save it I am not seeing duplication of the data. |
Read the custom schema modification code and tell me what you think it actually does. What was the actual type of that image? |
Here's the test I ran: It looks like it's taking the core schema meta data stripping all
Is the duplicated data under the meta key of the exposures or somewhere else in the file? |
Here's a minimal case without jwst import asdf
import asdf.fits_embed
import astropy.io.fits
import numpy
arr = numpy.zeros((1000, 100), dtype='f4')
# make hdu list with array from tree
hdulist = astropy.io.fits.HDUList()
hdulist.append(astropy.io.fits.PrimaryHDU())
hdu = astropy.io.fits.ImageHDU(name='SCI')
# assign data after creation to avoid a copy
hdu.data = arr
hdulist.append(hdu)
# save tree and hdu list
aif = asdf.fits_embed.AsdfInFits(hdulist, {'arr': arr})
aif.write_to('test1.fits', overwrite=True)
# read in asdf in fits
aif2 = asdf.open('test1.fits', mode='r')
aif2.write_to('test2.fits', overwrite=True)
ff1 = astropy.io.fits.open('test1.fits')
ff2 = astropy.io.fits.open('test2.fits')
print(f"original size = {ff1[-1].size}")
print(f"resave size = {ff2[-1].size}") This produces the following output:
Presumably we do not want to lose links between arrays in the tree and HDU items (is this the right term?). This could be made into a test to add to test_fits_embed. Per some emails with @perrygreenfield this issue appears to be related to linking HDUs and arrays and can be improved by changing find_or_create_block_for_array in EmbeddedBlockManager in fits_embed and util.get_array_base which both don't handle NDArrayType. diff --git a/asdf/fits_embed.py b/asdf/fits_embed.py
index b436281..093b416 100644
--- a/asdf/fits_embed.py
+++ b/asdf/fits_embed.py
@@ -92,7 +92,9 @@ class _EmbeddedBlockManager(block.BlockManager):
def find_or_create_block_for_array(self, arr, ctx):
from .tags.core import ndarray
- if not isinstance(arr, ndarray.NDArrayType):
+ #import pdb; pdb.set_trace()
+ #if not isinstance(arr, ndarray.NDArrayType):
+ if True:
base = util.get_array_base(arr)
for hdu in self._hdulist:
if hdu.data is None:
diff --git a/asdf/util.py b/asdf/util.py
index a45c863..a615da4 100644
--- a/asdf/util.py
+++ b/asdf/util.py
@@ -85,8 +85,9 @@ def get_array_base(arr):
For a given Numpy array, finds the base array that "owns" the
actual data.
"""
+ from .tags.core import ndarray
base = arr
- while isinstance(base.base, np.ndarray):
+ while isinstance(base.base, (np.ndarray, ndarray.NDArrayType)):
base = base.base
return base My original thoughts on using numpy.shares_memory instead of comparing base aren't compatible with the test_array_view_different_layout added in PR #930 to avoid writing invalid data in fits. I haven't replicated this but the diff above (which doesn't use shares_memory) doesn't cause the test to fail. |
Duplication can also occur if the data attribute of an hdu item is set to a non-ndarray type object. This will result in a copy (for example, assigning an NDArrayType to hdu.data will result in a copy) and data will then be duplicated as the link between the item in the hdulist and the array in the tree will be broken. |
NDArrayType fails checks to isinstance(arr, ndarray) as it is not a subclass of ndarray. This prevented util.get_array_base from finding the correct base array which contributed to failures to link hdu items to arrays within the tree (leading to duplication of data). The EmbeddedBlockManager in fits_embed also skipped checking for links between NDArrayType and hdu items which also broke the link between hdu items and arrays within the tree. This PR removes the skip. fixes issue asdf-format#1232
Here's another minimal case indicating that it seems (I think) to be coming from from jwst import datamodels
from asdf.fits_embed import AsdfInFits
from astropy.io import fits
dm = datamodels.ImageModel((2048, 2048))
dm.save("test.fits")
with fits.open("test.fits") as hdulist:
print(hdulist.info())
af = AsdfInFits.open("test.fits")
af.write_to("test2.fits")
with fits.open("test2.fits") as hdulist:
print(hdulist.info()) and one gets as output
Note that initializing the EDIT: oops, I realized that above there's already a minimal case with |
Thanks! Confirmation is always good :) Here's another minimal case using stdatamodels. import astropy.io.fits
import stdatamodels
import numpy
# --- this bit here replicates test infrastructure already in stdatamodels ---
cfg = asdf.get_config()
path_to_stdatamodels_repo = FILL_THIS_IN_FOR_YOUR_CHECKOUT
schema_dir = f"{path_to_stdatamodels_repo}/tests/schemas"
cfg.add_resource_mapping(asdf.resource.DirectoryResourceMapping(
schema_dir, "http://example.com/schemas"))
class FitsModel(stdatamodels.DataModel):
schema_url = "http://example.com/schemas/fits_model"
# --- end of replication --
arr = numpy.zeros((1000, 100), dtype='f4')
m = FitsModel(arr)
m.save('test1.fits')
m2 = FitsModel.from_fits('test1.fits')
m2.save('test2.fits')
ff1 = astropy.io.fits.open('test1.fits')
ff2 = astropy.io.fits.open('test2.fits')
print(f"original size = {ff1[-1].size}")
print(f"resave size = {ff2[-1].size}") I believe the fix will have to occur in ASDF (with discussion commencing on this PR #1233) but adding these cases as tests to stdatamodels and perhaps also jwst would help to catch any regressions. I will open PRs with tests for jwst and datamodels (with tests that currently fail and will be marked as skip). Once we have a fix we can unskip these (or wait to accept the PRs until the fix is available). |
See asdf issue asdf-format/asdf#1232 This test is currently skipped as it will fail with the current asdf version.
When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version.
I'm trying to think of a case where an inadvertent reference to an HDU array would happen. I'm asking if any JWST products appear with identical DQ or ERR arrays (e.g., with constant values, or identical defaults). That may affect whether this is an issue in the only case this stuff is likely to be used for. |
Even if this leads to multiple ASDF attributes pointing to the same HDU (with other HDUs having the same array contents that really should have matched), and some of those others not actually being used, I wonder if that really is a problem. The point of the FITS files is providing users with something they are comfortable with and know how to interpret. In most cases they would ignore the ADSF content. And for code that does use the ASDF content, it shouldn't care where the array values came from in the FITS file. The main worry with the ASDF is if different components are sharing the same array when they shouldn't. In most (and hopefully all) cases I think tasks make copies of the arrays, but I'm not completely sure of that. Another question for Howard. |
I asked Howard the following questions:
And Howard's responses: As to the first question, the DQ arrays get populated with something non-default in like the very first step of the pipeline and proceed with more updates from there. So I doubt there are ever any products where multiple DQ arrays would have all default (e.g. zero) values. ERR arrays don't get populated until the ramp_fit step, which is the end of Stage 1 processing (rate products). So it only be intermediate products saved before that stage (e.g. our 4D "ramp" product) that might still have all default ERR array values. But each product would only have 1 ERR array. It isn't until mid-way through Stage 2 processing that we ever get products containing multiple DQ or ERR arrays, which occurs for spectroscopic modes that have multiple sources split out from the original single image. But by that time the DQ and ERR arrays should never still be at default values. Most of the time we have steps make a copy of the input data before making changes. Originally we had things make corrections in-place, just to save memory (i.e. no need to have 2 copies of the data in memory all the time). That's fine when running a default pipeline end-to-end from the command line, because you get the correct final results in the output files, but when running interactively from within python, it's not always desirable. Users may not want to have their inputs tampered with, especially if they want to run a step multiple times to tweak params or whatever. |
When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version. modify imports and use context manager for test_fits Co-authored-by: James Davies <jdavies@mpia.de> use context manager in test_fits Co-authored-by: James Davies <jdavies@mpia.de> improve test docstring in test_fits
When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version. modify imports and use context manager for test_fits Co-authored-by: James Davies <jdavies@mpia.de> use context manager in test_fits Co-authored-by: James Davies <jdavies@mpia.de> improve test docstring in test_fits
NDArrayType fails checks to isinstance(arr, ndarray) as it is not a subclass of ndarray. This prevented util.get_array_base from finding the correct base array which contributed to failures to link hdu items to arrays within the tree (leading to duplication of data). The EmbeddedBlockManager in fits_embed also skipped checking for links between NDArrayType and hdu items which also broke the link between hdu items and arrays within the tree. This PR removes the skip. fixes issue asdf-format#1232
When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version. modify imports and use context manager for test_fits Co-authored-by: James Davies <jdavies@mpia.de> use context manager in test_fits Co-authored-by: James Davies <jdavies@mpia.de> improve test docstring in test_fits
* Add test for duplicate data when resaving model to fits When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version. modify imports and use context manager for test_fits Co-authored-by: James Davies <jdavies@mpia.de> use context manager in test_fits Co-authored-by: James Davies <jdavies@mpia.de> improve test docstring in test_fits * bump asdf version to 2.14.1 to fix hdu data unskip test for issue #7354 fixes #7354
* Add test for asdf data duplication issue with fits See asdf issue asdf-format/asdf#1232 This test is currently skipped as it will fail with the current asdf version. * bump asdf to 2.14.1 to fix hdu data duplication * update changelog
JP-2900 reports a failure of
calwebb_spec3
when it tries to create source-based products containing slit cutouts from multiple exposures. In one case the asdf extension grew to larger than 4GB, exceeding the size that astropy.io.fits can handle for a binary table entry. This revealed two problems. First, the pipeline was generating cutouts that made no sense (with vertical sizes in pixel in the hundreds) illustrating a problem with the segmentation algorithm and the lack of a sensible filter to reject cases of considerable confusion of sources. This led to trying to stuff many large cutouts into one file. Second, and the one related to a problem with ASDF, is that the same cutout array appeared both in a FITS HDU, and in the ASDF content contained in the ASDF HDU, meaning that data was unnecessarily duplicated in the file.The text was updated successfully, but these errors were encountered: