Skip to content

to_netcdf() to automatically switch to fixed-length strings for compressed variables  #2040

Closed
@crusaderky

Description

@crusaderky

When you have fixed-length numpy arrays of unicode characters (<U...) in a dataset, and you invoke to_netcdf() without any particular encoding, they are automatically stored as variable-length strings, unless you explicitly specify {'dtype': 'S1'}.

Is this in order to save disk space in case strings vary wildly in size? I may be able to see the point in this case.
However, this approach is disastrous if variables are compressed, as any compression algorithm will reduce the zero-panning at the end of the strings to a negligible size.

My test data: a dataset with ~50 variables, of which half are strings of 10~100 english characters and the other half are floats, all on a single dimension with 12k points.

Test 1:

ds.to_netcdf('uncompressed.nc')

Result: 45MB

Test 2:

encoding = {k: {'gzip': True, 'shuffle': True} for k in ds.variables}
ds.to_netcdf('bad-compression.nc', encoding=encoding)

Result: 42MB

Test 3:

encoding = {}
for k, v in ds.variables.items():
    encoding[k] = {'gzip': True, 'shuffle': True}
    if v.dtype.kind == 'U':
        encoding[k]['dtype'] = 'S1'
ds.to_netcdf('good-compression.nc', encoding=encoding)

Result: 5MB

Proposal

In case of string variables, if no dtype is explicitly defined, to_netcdf() should dynamically assign it to S1 if compression is enabled, str if disabled.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions