Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use magic bytes to identify file formats #143

Merged
merged 9 commits into from
Jun 26, 2024

Conversation

scottyhq
Copy link
Contributor

@scottyhq scottyhq commented Jun 14, 2024

@rabernat
Copy link
Collaborator

I would suggest considering just not trusting file extensions at all and only using the magics to identify file types.

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label Jun 14, 2024
@scottyhq
Copy link
Contributor Author

scottyhq commented Jun 14, 2024

I took a stab at just using magic bytes in the last commit. Note that I changed the 'netcdf4' mapping to 'hdf5' since that is the format behind the scenes based on magic_bytes. Tests seem to pass locally, I haven't added new tests, but did try the follow 'test urls' that are publicly available without authentication:

import fsspec
# https://en.wikipedia.org/wiki/List_of_file_signatures
examples = {
    'grib':'https://github.com/pydata/xarray-data/raw/master/era5-2mt-2019-03-uk.grib',
    'netcdf3':'https://github.com/pydata/xarray-data/raw/master/air_temperature.nc', 
    'netcdf4':'https://github.com/pydata/xarray-data/raw/master/ROMS_example.nc',
    'hdf4':'https://github.com/corteva/rioxarray/raw/master/test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf',
     # https://nisar.jpl.nasa.gov/data/sample-data/
    'hdf5':'https://nisar.asf.earthdatacloud.nasa.gov/NISAR-SAMPLE-DATA/Soil_Moisture/ALOS-2/NISAR_L3_PR_SME2_001_008_D_070_4000_QPNA_A_20190829T180759_20190829T180809_P01101_M_P_J_001.h5',
    'tif':'https://github.com/corteva/rioxarray/raw/master/test/test_data/input/cog.tif',
     # https://github.com/astropy/astropy/blob/4d034aa7e27e31cb0241cc01bbe76eab47406a91/astropy/io/fits/tests/test_fsspec.py#L73
    'fits':'https://mast.stsci.edu/api/v0.1/Download/file/?uri=mast:HST/product/ibxl50020_jif.fits',
    'jpg': 'https://github.com/rasterio/rasterio/raw/main/tests/data/389225main_sw_1965_1024.jpg',

}

for file_type,url in examples.items():
    with fsspec.open(url) as f:
        magic_bytes = f.read(8)
        print(file_type, magic_bytes)
    
    if magic_bytes.startswith(b"CDF"):
        print('netCDF3!')
    elif magic_bytes.startswith(b"\x89HDF"):
        print('HDF5 / netCDF4!')
    elif magic_bytes.startswith(b"\x0e\x03\x13\x01"):
        print('HDF4!')
    elif magic_bytes.startswith(b'GRIB'):
        print('GRIB!')
    elif magic_bytes.startswith(b'II*'):
        print('TIFF!')
    elif magic_bytes.startswith(b'SIMPLE'):
        print('FITS!')
    else:
        raise NotImplementedError(f"Unrecognised file based on header bytes: {magic_bytes}")

@TomNicholas
Copy link
Member

This is neat, thanks @scottyhq !

I agree the magics are much more robust.

That list of examples is cool - is there any point adding tests of opening those files to the test suite?

@scottyhq
Copy link
Contributor Author

That list of examples is cool - is there any point adding tests of opening those files to the test suite?

I think it's useful. I could add a @network marker as Xarray does? Should have some time to do that later this week
https://github.com/pydata/xarray/blob/be8e17e4dc5da67d7cbb09db87d80c1bbc71a64e/conftest.py#L10

@TomNicholas
Copy link
Member

I could add a @network marker as Xarray does?

Yes nice. This test of reading from s3 could also use that marker.

@scottyhq scottyhq changed the title add more common file suffixes Use magic bytes to identify file formats Jun 25, 2024
@@ -86,7 +88,7 @@ def read_kerchunk_references_from_file(

refs = NetCDF3ToZarr(filepath, inline_threshold=0, **reader_options).translate()

elif filetype.name.lower() == "netcdf4":
elif filetype.name.lower() == "hdf5" or filetype.name.lower() == "netcdf4":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping netcdf4 as an alias for hdf5 because i've seen code in reported issues using filetype=netcdf4

@@ -325,6 +348,13 @@ def test_loadable_variables(self, netcdf4_file):
if name in vars_to_load:
xrt.assert_identical(vds.variables[name], full_ds.variables[name])

def test_explicit_filetype(self, netcdf4_file):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure of best way to check these VDS are identical:

 vds1 = open_virtual_dataset(netcdf4_file) 
 vds2 = open_virtual_dataset(netcdf4_file, filetype="netcdf4") 
 vds2 = open_virtual_dataset(netcdf4_file, filetype="hdf5")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you open with indexes={}, does xrt.assert_identical work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to work:

AssertionError: Left and right Dataset objects are not identical

Differing coordinates:
L   ocean_time  (ocean_time) float64 16B ManifestArray<shape=(2,), dtype=floa...
R   ocean_time  (ocean_time) float64 16B ManifestArray<shape=(2,), dtype=floa...
L   s_rho       (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
R   s_rho       (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
Differing data variables:
L   lat_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R   lat_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L   Cs_r        (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
R   Cs_r        (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
L   h           (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R   h           (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L   lon_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R   lon_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L   hc          float64 8B ManifestArray<shape=(), dtype=float64, chunks=()>
R   hc          float64 8B ManifestArray<shape=(), dtype=float64, chunks=()>
L   salt        (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB ManifestArr...
R   salt        (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB ManifestArr...
L   zeta        (ocean_time, eta_rho, xi_rho) float32 567kB ManifestArray<sha...
R   zeta        (ocean_time, eta_rho, xi_rho) float32 567kB ManifestArray<sha...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay - I've raised #161 to track that.

Comment on lines +319 to +323
# https://github.com/zarr-developers/VirtualiZarr/issues/159
# "hdf5": "https://github.com/fsspec/kerchunk/raw/main/kerchunk/tests/NEONDSTowerTemperatureData.hdf5",
# https://github.com/zarr-developers/VirtualiZarr/issues/160
# "tiff": "https://github.com/fsspec/kerchunk/raw/main/kerchunk/tests/lcmap_tiny_cog_2020.tif",
# "fits": "https://fits.gsfc.nasa.gov/samples/WFPC2u5780205r_c0fx.fits",
Copy link
Contributor Author

@scottyhq scottyhq Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting these for now, but thought I'd leave them to track #159 and #160

@scottyhq
Copy link
Contributor Author

Worked on this a bit more @TomNicholas and I think it's ready to go. Added tests for all formats and linked to example files for all kerchunk-supported formats... But it seems like some additional work is needed to fully support TIFF, FITS, and HDF5.

@TomNicholas TomNicholas merged commit 412c23c into zarr-developers:main Jun 26, 2024
7 checks passed
@TomNicholas
Copy link
Member

Thanks so much @scottyhq !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NotImplementedError: Unrecognised file extension: .nc4 (and other common ones .h5, .hdf, .tif)
3 participants