-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is your feature request related to a problem?
I am attempting to reduce the size of Datasets stored on disk and sent via network while maintaining data accuracy. One compelling way to do this is to use integer storage, since the data source is natively in int16 format. However, the consumers of these Datasets is expecting data scaled into physical units, and that requires moving to a float64 data type and multiplying by a float scaling factor.
On the surface, this seems like a great fit for the scale_factor encoding parameter. The main issue I have is loss of accuracy. In order to use the scale_factor encoding parameter, the data manipulation involved looks like:
- Convert data from
int16tofloat64, multiply by the scale factor, and add offset if needed. - Write to file using one of the
to_Xmethods (e.g.to_netcdf()).- The write-file operation applies the reverse scaling and casts to the target datatype, so int16 data is stored.
- Consumer then loads data, using
mask_and_scale=True(default).- When the data is loaded, data is cast to a float type, multiplied by
scale_factor, andadd_offsetadded to it.
- When the data is loaded, data is cast to a float type, multiplied by
This set of manipulations adds two unnecessary operations (1 and 2 above), if one could store scale-encoded data directly. Unnecessary error due to casting and float calculations would be avoided by reducing the number of conversions/calculations on the source data.
For reference, I am using the netCDF4 format with h5netcdf engine, but I'm not sure if the engine or the file format matters with regard to scale encoding.
Describe the solution you'd like
Add a mask_and_scale argument on to_X write methods (e.g. to_netcdf()), so a producer can provide pre-scale-encoded data, but also provide the scale encoding parameters so the load operation can apply the scale operation if mask_and_scale=True. Essentially, if this proposed mask_and_scale argument were True, the library would assume the user has pre-scaled and pre-masked their data if encoding has been set. Perhaps some basic check to ensure the dtype matched the target encoding, as a sanity check.
In concept, this seems very similar to the mask_and_scale argument to load_dataset(), but in the opposite direction.
Describe alternatives you've considered
- Using compression to reduce datastore size
- This is somewhat effective but has runtime consequences on both load and write sides of the operation. Will likely consider this in the near term.
- Storing the dataset directly as
int16data, and forcing the consumer to scale the data themselves.- I'd rather not push this onto the consumer's code, but it might be workable.
- Doing some math to "force" the rounding/casting outcome in the end to match the source data
- Seems error-prone and cumbersome, but may work in some situations.
Additional context
No response