Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline loaded variables into kerchunk references #73

Merged
merged 25 commits into from
May 16, 2024
Merged

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Apr 5, 2024

Would close case (1) of #62.

After #69 was merged, the next step (this PR) is to write loaded numpy arrays into the kerchunk references "inlined".

@TomNicholas TomNicholas added enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself labels Apr 5, 2024
@TomNicholas
Copy link
Member Author

The important part of this PR now works, but there's some annoying inconsistency in how kerchunk writes out NaN vs "NaN" vs null which is causing the test to fail 😠

@TomNicholas
Copy link
Member Author

TomNicholas commented Apr 6, 2024

For reference the reason this test fails is that kerchunk.hdf.SingleHdf5ToZarr produces a references dict with these entries:

{
    'air/.zarray': '{"chunks":[2920,25,53],"compressor":null,"dtype":"<i2","fill_value":null,"filters":null,"order":"C","shape":[2920,25,53],"zarr_format":2}',
    ...
    'lat/.zarray': '{"chunks":[25],"compressor":null,"dtype":"<f4","fill_value":"NaN","filters":null,"order":"C","shape":[25],"zarr_format":2}',
...
}

Notice that for the variable air the default zarr fill_value of np.NaN is encoded as null, whereas for the variable lat it is encoded as "NaN". I see no possible reason for this discrepancy, or what the pattern is, so I can't really make my code consistent with inconsistent behaviour 🤷‍♂️ The fsspec mapper apparently accepts either.

Copy link

codecov bot commented Apr 6, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 90.49%. Comparing base (f226093) to head (cd6ca47).
Report is 23 commits behind head on main.

❗ Current head cd6ca47 differs from pull request most recent head 0baee1a. Consider uploading reports for the commit 0baee1a to get more accurate results

Files Patch % Lines
virtualizarr/kerchunk.py 85.71% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
+ Coverage   90.18%   90.49%   +0.31%     
==========================================
  Files          14       15       +1     
  Lines         998     1021      +23     
==========================================
+ Hits          900      924      +24     
+ Misses         98       97       -1     
Flag Coverage Δ
unittests 90.49% <93.75%> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TomNicholas
Copy link
Member Author

Whilst this PR now passes for "inlining" the lat and lon variables (yay), it will fail if you attempt to load either the air variable or the time variable.

The air variable case is some json encoding error, which should be fixable with the right set of arguments to json.dumps.

The time variable case is I think much harder - the encoded bytes look different to those that come out of kerchunk, presumably because kerchunk is doing some kind of CF-related encoding first. But looking inside kerchunk.hdf5.SingleHdfToZarr I am none the wiser as to what it's doing that I would need to replicate.

Comment on lines 225 to 226
# TODO can this be generalized to save individual chunks of a dask array?
arr_refs = {'0': inlined_data}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be cool because we could apply the encoding to every block in parallel (using map_blocks with #39), and that would allow the user to choose their chunking in the kerchunk output just by rechunking this variable using xarray before serializing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also doing this for only some of the chunks would effectively implement case (2) of #62.

@TomNicholas
Copy link
Member Author

The time variable case is I think much harder - the encoded bytes look different to those that come out of kerchunk, presumably because kerchunk is doing some kind of CF-related encoding first. But looking inside kerchunk.hdf5.SingleHdfToZarr I am none the wiser as to what it's doing that I would need to replicate.

So if I simply pass decode_times=False to xr.open_dataset internally then this test passes.
I think this implies that kerchunk is not actually doing any decoding of times at all?

@dcherian
Copy link

dcherian commented Apr 10, 2024

https://fsspec.github.io/kerchunk/reference.html#kerchunk.combine.MultiZarrToZarr

”cf:{var}”, interpret the value of var using cftime, returning a datetime. 
These will be automatically re-encoded with cftime, unless you specify an “M8[*]” 
dtype for the coordinate, in which case a conversion will be attempted.

@TomNicholas
Copy link
Member Author

I have no idea what this is:

>       expected = SingleHdf5ToZarr(netcdf4_file, spec=1, inline_threshold=500).translate()

virtualizarr/tests/test_integration.py:8: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/kerchunk/hdf.py:93: in __init__
    self._h5f = h5py.File(self.input_file, mode="r")
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/h5py/_hl/files.py:562: in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/h5py/_hl/files.py:235: in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   OSError: Unable to synchronously open file (file signature not found)

Given that the tests all pass for me locally

@TomNicholas
Copy link
Member Author

TomNicholas commented May 3, 2024

So right now I can successfully inline (i.e. match kerchunk) the lat and lon variables in the xarray tutorial dataset (which have no encoding), but cannot serialize the air variable (which has scale_factor and offset encoding), nor the time variable (which has cf calendar and units encoding).

So I've parametrized the test to show what works and xfailed the ones that don't. I'm also now raising NotImplementedErrors when encoding is encountered.

@TomNicholas
Copy link
Member Author

@norlandrhagen do you have any idea what this error is? Or even how I might try to reproduce it locally?

OSError: Unable to synchronously open file (file signature not found)

@TomNicholas
Copy link
Member Author

TomNicholas commented May 15, 2024

@martindurant I'm seeing an error in a call to SingleHdf5ToZarr (which I'm calling for comparison purposes) - but it's only occurring in the CI. All my google results for this error say that it means the file is corrupted, but I don't understand why it would be corrupted when written out from xarray (and openable locally). Have you seen this before?

EDIT: Nevermind @norlandrhagen came to my rescue! ❤️

@norlandrhagen norlandrhagen mentioned this pull request May 15, 2024
1 task
@TomNicholas TomNicholas merged commit 2d61d1a into main May 16, 2024
5 checks passed
@TomNicholas TomNicholas deleted the inline_chunkdata branch May 16, 2024 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants