Description
When you have fixed-length numpy arrays of unicode characters (<U...) in a dataset, and you invoke to_netcdf() without any particular encoding, they are automatically stored as variable-length strings, unless you explicitly specify {'dtype': 'S1'}
.
Is this in order to save disk space in case strings vary wildly in size? I may be able to see the point in this case.
However, this approach is disastrous if variables are compressed, as any compression algorithm will reduce the zero-panning at the end of the strings to a negligible size.
My test data: a dataset with ~50 variables, of which half are strings of 10~100 english characters and the other half are floats, all on a single dimension with 12k points.
Test 1:
ds.to_netcdf('uncompressed.nc')
Result: 45MB
Test 2:
encoding = {k: {'gzip': True, 'shuffle': True} for k in ds.variables}
ds.to_netcdf('bad-compression.nc', encoding=encoding)
Result: 42MB
Test 3:
encoding = {}
for k, v in ds.variables.items():
encoding[k] = {'gzip': True, 'shuffle': True}
if v.dtype.kind == 'U':
encoding[k]['dtype'] = 'S1'
ds.to_netcdf('good-compression.nc', encoding=encoding)
Result: 5MB
Proposal
In case of string variables, if no dtype is explicitly defined, to_netcdf() should dynamically assign it to S1 if compression is enabled, str if disabled.