Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding netCDF is giving incorrect values for a large file #5597

Closed
ohsqueezy opened this issue Jul 13, 2021 · 15 comments · Fixed by #8713
Closed

Decoding netCDF is giving incorrect values for a large file #5597

ohsqueezy opened this issue Jul 13, 2021 · 15 comments · Fixed by #8713

Comments

@ohsqueezy
Copy link

ohsqueezy commented Jul 13, 2021

What happened:

0 value is decoded as 2

What you expected to happen:

Data encoded to -32766 should translate to 0

Minimal Complete Verifiable Example:

The first example is the base file I've been using, which is a 9GB packed netCDF. The first 12 values in this lookup should be 0 but are getting decoded as 2.

$ xarray.open_dataset("BIG_FILE_packed.nc").ssrd.isel(time=slice(0, 23)).sel(latitude=44.8, longitude=287.1, method="nearest").values
> array([2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00,
         2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00,
         2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00,
         2.565200e+04, 3.547440e+05, 1.091760e+06, 2.170378e+06,
         3.482364e+06, 4.704884e+06, 5.689655e+06, 6.297786e+06,
         6.534908e+06, 6.543667e+06, 6.543667e+06], dtype=float32)

This second example shows that if the file is decoded without automatic mask_and_scale, the value is decoded as 0 when applying the scale factor and add offset to an example value in the interpreter.

$ xarray.open_dataset("BIG_FILE_packed.nc", mask_and_scale=False).ssrd.isel(time=slice(0, 23)).sel(\
                      latitude=44.8, longitude=287.1, method="nearest").values
> array([-32766, -32766, -32766, -32766, -32766, -32766, -32766, -32766,
         -32766, -32766, -32766, -32766, -32725, -32199, -31021, -29297,
         -27200, -25246, -23672, -22700, -22321, -22307, -22307], dtype=int16)

$ xarray.open_dataset("BIG_FILE_packed.nc", mask_and_scale=False).ssrd.isel(time=slice(0, 23)).sel(\
                      latitude=44.8, longitude=287.1, method="nearest").values[0] * \
  xarray.open_dataset("BIG_FILE_packed.nc").ssrd.encoding["scale_factor"] + xarray.open_dataset("BIG_FILE_packed.nc").ssrd.encoding["add_offset"]
> 0.0

When the netCDF is unpacked using the nco command line tool, the correct values are unpacked.

$ xarray.open_dataset("BIG_FILE_unpacked.nc").ssrd.isel(time=slice(0, 23)).sel(latitude=44.8, longitude=287.1, method="nearest").values
> array([      0.        ,       0.        ,       0.        ,
               0.        ,       0.        ,       0.        ,
               0.        ,       0.        ,       0.        ,
               0.        ,       0.        ,       0.        ,
           25651.61906215,  354743.1221522 , 1091757.933255  ,
         2170377.23235622, 3482363.69999847, 4704882.32554591,
         5689654.23783437, 6297785.304381  , 6534906.36839455,
         6543665.4578304 , 6543665.4578304 ])

Something else that may be relevant is that another file with this same packed data but as much smaller subset (1.7KB) of the big file is unpacked correctly.

$ xarray.open_dataset("SMALL_FILE_packed.nc").ssrd.isel(time=slice(0, 23)).sel(latitude=44.8, longitude=287.1, method="nearest").values
> array([      0.  ,       0.  ,       0.  ,       0.  ,       0.  ,
               0.  ,       0.  ,       0.  ,       0.  ,       0.  ,
               0.  ,       0.  ,   25545.75,  354397.5 , 1091577.  ,
         2170077.  , 3482645.8 , 4704689.  , 5689927.  , 6297856.5 ,
         6535169.  , 6543583.  , 6543583.  ], dtype=float32)

For this to be a real verifiable example, I can transfer the 9GB file to someone or give instructions on how to download it from the climate API I'm getting it from! I'm not sure if this is an issue with xarray or the API or something I'm doing wrong. I've mostly been using an older version of xarray, but I also tested on the most recent version available on PIP:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.6 (default, Jul 8 2021, 20:44:16)
[GCC 5.4.0 20160609]
python-bits: 64
OS: Linux
OS-release: 4.4.0-200-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.18.2
pandas: 1.3.0
numpy: 1.21.0
scipy: None
netCDF4: 1.5.7
pydap: None
h5netcdf: 0.11.0
h5py: 3.3.0
Nio: None
zarr: None
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.9.0
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 51.3.3
pip: 20.3.3
conda: None
pytest: None
IPython: 7.25.0
sphinx: None

@max-sixty
Copy link
Collaborator

A small question to clarify:

When the netCDF is unpacked using the nco command line tool, the correct values are unpacked.

$ xarray.open_dataset("BIG_FILE_unpacked.nc").ssrd.isel(time=slice(0, 23)).sel(latitude=44.8, longitude=287.1, method="nearest").values
> array([      0.        ,       0.        ,       0.        ,
               0.        ,       0.        ,       0.        ,
               0.        ,       0.        ,       0.        ,
               0.        ,       0.        ,       0.        ,
           25651.61906215,  354743.1221522 , 1091757.933255  ,
         2170377.23235622, 3482363.69999847, 4704882.32554591,
         5689654.23783437, 6297785.304381  , 6534906.36839455,
         6543665.4578304 , 6543665.4578304 ])

Is that the output of the .open_dataset command? Is that the same code that generates the 2s? Or is the command that generates the zero different?

@ohsqueezy
Copy link
Author

That example is actually a different file than the original. I unpacked the original file externally using ncpdq -U BIG_FILE_packed.nc BIG_FILE_unpacked.nc before opening it with xarray, so the decoding step is skipped and there aren't any 2 values generated. The data is correct using that method, so it's a possible workaround, but unpacking externally makes each file 4x larger.

In all the examples, the data is the same time and location, so they should be the same values outside of whatever is lost from compressing to int16 and decompressing, and the output arrays are from selecting a single day (24 hours) at a single location from the dataset returned by open_dataset in the ipython interpreter.

So actually there are three files I've tested with, all of which should have the same data (assuming the issue isn't with how the files are built, which could be the case): BIG_FILE_packed.nc BIG_FILE_unpacked.nc and SMALL_FILE_packed.nc, and the only one that displays the issue is the first one.

@kmuehlbauer
Copy link
Contributor

@ohsqueezy Does this issue also show up, if just plain netCDF4 is used to open the files?

@max-sixty
Copy link
Collaborator

Thanks. Does passing different values to engine= make any difference? I suspect the issue is at that layer.

@ohsqueezy
Copy link
Author

Thanks for your help!

I checked using the netCDF4 module, and the data is returned correctly

$ d = netCDF4.Dataset("BIG_FILE_packed.nc")
$ d["ssrd"][d["time"][:] < d["time"][24], d["latitude"][:] == 44.8, d["longitude"][:] == 287.1]
> masked_array(
  data=[[[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[      0.        ]],
        [[  25651.61906215]],
        [[ 354743.1221522 ]],
        [[1091757.933255  ]],
        [[2170377.23235622]],
        [[3482363.69999847]],
        [[4704882.32554591]],
        [[5689654.23783437]],
        [[6297785.304381  ]],
        [[6534906.36839455]],
        [[6543665.4578304 ]],
        [[6543665.4578304 ]],
        [[6543665.4578304 ]]],
  mask=False,
  fill_value=1e+20)

I tried with scipy as the engine, and it still returns the 2 values

$ xarray.open_dataset("BIG_FILE_packed.nc", engine="scipy").ssrd.sel(latitude=44.8, longitude=287.1, method="nearest").values[:23]                                               
> array([2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00,
       2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00,
       2.000000e+00, 2.000000e+00, 2.000000e+00, 2.000000e+00,
       2.565200e+04, 3.547440e+05, 1.091760e+06, 2.170378e+06,
       3.482364e+06, 4.704884e+06, 5.689655e+06, 6.297786e+06,
       6.534908e+06, 6.543667e+06, 6.543667e+06], dtype=float32)

I should mention that in another large packed dataset from this API, I have gotten the same error but with a very small decimal value in place of the zero instead of 2.

@kmuehlbauer
Copy link
Contributor

@ohsqueezy You might also try engine="h5netcdf (h5py/h5netcdf packages needed). And would it be possible create a small subset of that file via netCDF4 to share?

@ohsqueezy
Copy link
Author

h5netcdf seems to be a separate issue for me as it gives me the error

OSError: Unable to open file (file signature not found)

I looked into it once though, and I think I might be able to fix that. I'll also see if I can build a small netCDF that has reproducible behavior!

@keewis
Copy link
Collaborator

keewis commented Jul 13, 2021

can you post the value of xarray.open_dataset("BIG_FILE_packed.nc").ssrd.encoding?

@ohsqueezy
Copy link
Author

sure, no prob

$ xarray.open_dataset("BIG_FILE_packed.nc").ssrd.encoding           
{'source': 'BIG_FILE_packed.nc',
 'original_shape': (743, 1801, 3600),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 625.6492454183389,
 'add_offset': 20500023.17537729}

@shoyer
Copy link
Member

shoyer commented Jul 13, 2021

This may just be the expected floating point error from using float32:

In [5]: import numpy as np

In [6]: -32766 * np.float32(625.6492454183389) + np.float32(20500023.17537729)
Out[6]: 1.2984619140625

If you use full float64, then the data does decode to 0.0:

In [7]: -32766 * np.float64(625.6492454183389) + np.float64(20500023.17537729)
Out[7]: 0.0

So the question then is why this ends up being decoded using float32 instead of float64, and if that logic should be adjusted or made customizable:

def _choose_float_dtype(dtype, has_offset):

@ohsqueezy
Copy link
Author

That explains it to me! Not sure if it's still useful but I exported the subset as a netCDF file.

In [59]: packed_vals = xarray.open_dataset("packed_solar_data_subset.nc", mask_and_scale=False).ssrd.values                                      

In [60]: packed_vals[0] * numpy.float32(e["scale_factor"]) + numpy.float32(e["add_offset"])                                                            
Out[60]: 2.0

In [61]: packed_vals[0] * numpy.float64(e["scale_factor"]) + numpy.float64(e["add_offset"])                                                            
Out[61]: 0.0

Hm actually I think converting the packed vals to 64 bit and then decoding does what I'm looking for

In [62]: xarray.decode_cf(xarray.open_dataset("packed_solar_data_subset.nc", mask_and_scale=False).astype(numpy.float64)).ssrd.values            
Out[62]: 
array([      0.        ,       0.        ,       0.        ,
             0.        ,       0.        ,       0.        ,
             0.        ,       0.        ,       0.        ,
             0.        ,       0.        ,       0.        ,
         25651.61906215,  354743.1221522 , 1091757.933255  ,
       2170377.23235622, 3482363.69999847, 4704882.32554591,
       5689654.23783437, 6297785.304381  , 6534906.36839455,
       6543665.4578304 , 6543665.4578304 ])

@shoyer
Copy link
Member

shoyer commented Jul 14, 2021

Thanks for sharing the subset netCDF file, that is very helpful for debugging indeed!

The weird thing is that the dtype picking logic seems to have a special case that, per the code comment, suggesting we want to be using float64 here:

# float32 can exactly represent all integers up to 24 bits
if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer):
# A scale factor is entirely safe (vanishing into the mantissa),
# but a large integer offset could lead to loss of precision.
# Sensitivity analysis can be tricky, so we just use a float64
# if there's any offset at all - better unoptimised than wrong!
if not has_offset:
return np.float32

But in fact, the dtype picking logic doesn't do that, because the dtype is already converted into float32, first. The culprit seems to be this line in CFMaskCoder, which promotes the dtype to float32 to fit a fill-value of NaN:

dtype, decoded_fill_value = dtypes.maybe_promote(data.dtype)

To fix this, I think logic in _choose_float_dtype should be updated to look at encoding['dtype'] (if available) instead of dtype, in order to understand how the data was originally stored.

@kmuehlbauer
Copy link
Contributor

To fix this, I think logic in _choose_float_dtype should be updated to look at encoding['dtype'] (if available) instead of dtype, in order to understand how the data was originally stored.

This is aimed at in #7654

@kmuehlbauer
Copy link
Contributor

New attempt here: #8713

@kmuehlbauer
Copy link
Contributor

@ohsqueezy After #7654 going out of focus I've resurrected the changes in #8713. Would you mind taking a look there, maybe test and let us know your thoughts on that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants