Skip to content

Allow writing pre-masked and pre-scaled DataSet to Store #11207

@PSquaredX

Description

@PSquaredX

Is your feature request related to a problem?

I am attempting to reduce the size of Datasets stored on disk and sent via network while maintaining data accuracy. One compelling way to do this is to use integer storage, since the data source is natively in int16 format. However, the consumers of these Datasets is expecting data scaled into physical units, and that requires moving to a float64 data type and multiplying by a float scaling factor.

On the surface, this seems like a great fit for the scale_factor encoding parameter. The main issue I have is loss of accuracy. In order to use the scale_factor encoding parameter, the data manipulation involved looks like:

  1. Convert data from int16 to float64, multiply by the scale factor, and add offset if needed.
  2. Write to file using one of the to_X methods (e.g. to_netcdf()).
    • The write-file operation applies the reverse scaling and casts to the target datatype, so int16 data is stored.
  3. Consumer then loads data, using mask_and_scale=True (default).
    • When the data is loaded, data is cast to a float type, multiplied by scale_factor, and add_offset added to it.

This set of manipulations adds two unnecessary operations (1 and 2 above), if one could store scale-encoded data directly. Unnecessary error due to casting and float calculations would be avoided by reducing the number of conversions/calculations on the source data.

For reference, I am using the netCDF4 format with h5netcdf engine, but I'm not sure if the engine or the file format matters with regard to scale encoding.

Describe the solution you'd like

Add a mask_and_scale argument on to_X write methods (e.g. to_netcdf()), so a producer can provide pre-scale-encoded data, but also provide the scale encoding parameters so the load operation can apply the scale operation if mask_and_scale=True. Essentially, if this proposed mask_and_scale argument were True, the library would assume the user has pre-scaled and pre-masked their data if encoding has been set. Perhaps some basic check to ensure the dtype matched the target encoding, as a sanity check.

In concept, this seems very similar to the mask_and_scale argument to load_dataset(), but in the opposite direction.

Describe alternatives you've considered

  1. Using compression to reduce datastore size
    • This is somewhat effective but has runtime consequences on both load and write sides of the operation. Will likely consider this in the near term.
  2. Storing the dataset directly as int16 data, and forcing the consumer to scale the data themselves.
    • I'd rather not push this onto the consumer's code, but it might be workable.
  3. Doing some math to "force" the rounding/casting outcome in the end to match the source data
    • Seems error-prone and cumbersome, but may work in some situations.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions