-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use magic bytes to identify file formats #143
Conversation
I would suggest considering just not trusting file extensions at all and only using the magics to identify file types. |
I took a stab at just using magic bytes in the last commit. Note that I changed the 'netcdf4' mapping to 'hdf5' since that is the format behind the scenes based on magic_bytes. Tests seem to pass locally, I haven't added new tests, but did try the follow 'test urls' that are publicly available without authentication: import fsspec
# https://en.wikipedia.org/wiki/List_of_file_signatures
examples = {
'grib':'https://github.com/pydata/xarray-data/raw/master/era5-2mt-2019-03-uk.grib',
'netcdf3':'https://github.com/pydata/xarray-data/raw/master/air_temperature.nc',
'netcdf4':'https://github.com/pydata/xarray-data/raw/master/ROMS_example.nc',
'hdf4':'https://github.com/corteva/rioxarray/raw/master/test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf',
# https://nisar.jpl.nasa.gov/data/sample-data/
'hdf5':'https://nisar.asf.earthdatacloud.nasa.gov/NISAR-SAMPLE-DATA/Soil_Moisture/ALOS-2/NISAR_L3_PR_SME2_001_008_D_070_4000_QPNA_A_20190829T180759_20190829T180809_P01101_M_P_J_001.h5',
'tif':'https://github.com/corteva/rioxarray/raw/master/test/test_data/input/cog.tif',
# https://github.com/astropy/astropy/blob/4d034aa7e27e31cb0241cc01bbe76eab47406a91/astropy/io/fits/tests/test_fsspec.py#L73
'fits':'https://mast.stsci.edu/api/v0.1/Download/file/?uri=mast:HST/product/ibxl50020_jif.fits',
'jpg': 'https://github.com/rasterio/rasterio/raw/main/tests/data/389225main_sw_1965_1024.jpg',
}
for file_type,url in examples.items():
with fsspec.open(url) as f:
magic_bytes = f.read(8)
print(file_type, magic_bytes)
if magic_bytes.startswith(b"CDF"):
print('netCDF3!')
elif magic_bytes.startswith(b"\x89HDF"):
print('HDF5 / netCDF4!')
elif magic_bytes.startswith(b"\x0e\x03\x13\x01"):
print('HDF4!')
elif magic_bytes.startswith(b'GRIB'):
print('GRIB!')
elif magic_bytes.startswith(b'II*'):
print('TIFF!')
elif magic_bytes.startswith(b'SIMPLE'):
print('FITS!')
else:
raise NotImplementedError(f"Unrecognised file based on header bytes: {magic_bytes}") |
This is neat, thanks @scottyhq ! I agree the magics are much more robust. That list of examples is cool - is there any point adding tests of opening those files to the test suite? |
I think it's useful. I could add a |
Yes nice. This test of reading from s3 could also use that marker. |
@@ -86,7 +88,7 @@ def read_kerchunk_references_from_file( | |||
|
|||
refs = NetCDF3ToZarr(filepath, inline_threshold=0, **reader_options).translate() | |||
|
|||
elif filetype.name.lower() == "netcdf4": | |||
elif filetype.name.lower() == "hdf5" or filetype.name.lower() == "netcdf4": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping netcdf4 as an alias for hdf5 because i've seen code in reported issues using filetype=netcdf4
@@ -325,6 +348,13 @@ def test_loadable_variables(self, netcdf4_file): | |||
if name in vars_to_load: | |||
xrt.assert_identical(vds.variables[name], full_ds.variables[name]) | |||
|
|||
def test_explicit_filetype(self, netcdf4_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure of best way to check these VDS are identical:
vds1 = open_virtual_dataset(netcdf4_file)
vds2 = open_virtual_dataset(netcdf4_file, filetype="netcdf4")
vds2 = open_virtual_dataset(netcdf4_file, filetype="hdf5")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you open with indexes={}
, does xrt.assert_identical
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to work:
AssertionError: Left and right Dataset objects are not identical
Differing coordinates:
L ocean_time (ocean_time) float64 16B ManifestArray<shape=(2,), dtype=floa...
R ocean_time (ocean_time) float64 16B ManifestArray<shape=(2,), dtype=floa...
L s_rho (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
R s_rho (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
Differing data variables:
L lat_rho (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R lat_rho (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L Cs_r (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
R Cs_r (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
L h (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R h (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L lon_rho (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R lon_rho (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L hc float64 8B ManifestArray<shape=(), dtype=float64, chunks=()>
R hc float64 8B ManifestArray<shape=(), dtype=float64, chunks=()>
L salt (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB ManifestArr...
R salt (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB ManifestArr...
L zeta (ocean_time, eta_rho, xi_rho) float32 567kB ManifestArray<sha...
R zeta (ocean_time, eta_rho, xi_rho) float32 567kB ManifestArray<sha...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm okay - I've raised #161 to track that.
# https://github.com/zarr-developers/VirtualiZarr/issues/159 | ||
# "hdf5": "https://github.com/fsspec/kerchunk/raw/main/kerchunk/tests/NEONDSTowerTemperatureData.hdf5", | ||
# https://github.com/zarr-developers/VirtualiZarr/issues/160 | ||
# "tiff": "https://github.com/fsspec/kerchunk/raw/main/kerchunk/tests/lcmap_tiny_cog_2020.tif", | ||
# "fits": "https://fits.gsfc.nasa.gov/samples/WFPC2u5780205r_c0fx.fits", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked on this a bit more @TomNicholas and I think it's ready to go. Added tests for all formats and linked to example files for all kerchunk-supported formats... But it seems like some additional work is needed to fully support TIFF, FITS, and HDF5. |
Thanks so much @scottyhq ! |
docs/releases.rst
api.rst