Skip to content

Commit 6d93a95

Browse files
jmccreightdcherian
authored andcommitted
Handle the character array dim name (#2896)
* Handle the charachter array dim name in a variables encoding, set in decode and reapply in encode * Document char_dim_name * Minor change to set of char_dim_name * Test the roundtrip of the char_dim_name in encoding. * pep8 or die * Better test for char_dim_name * pep8 79char madness * nix test logic, use multiple parameterized vars * When encoding and encoding, remove it from encoding * Simpler is better * pep8 visual indent complaint * what is new! * what is newer than new! * what is newer than newer! * what is newer than newer-er! * what is newer than newer-est!
1 parent c8251e3 commit 6d93a95

File tree

4 files changed

+45
-12
lines changed

4 files changed

+45
-12
lines changed

doc/io.rst

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -302,16 +302,23 @@ to using encoded character arrays. Character arrays can be selected even for
302302
netCDF4 files by setting the ``dtype`` field in ``encoding`` to ``S1``
303303
(corresponding to NumPy's single-character bytes dtype).
304304

305-
If character arrays are used, the string encoding that was used is stored on
306-
disk in the ``_Encoding`` attribute, which matches an ad-hoc convention
307-
`adopted by the netCDF4-Python library <https://github.com/Unidata/netcdf4-python/pull/665>`_.
308-
At the time of this writing (October 2017), a standard convention for indicating
309-
string encoding for character arrays in netCDF files was
310-
`still under discussion <https://github.com/Unidata/netcdf-c/issues/402>`_.
311-
Technically, you can use
312-
`any string encoding recognized by Python <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ if you feel the need to deviate from UTF-8,
313-
by setting the ``_Encoding`` field in ``encoding``. But
314-
`we don't recommend it <http://utf8everywhere.org/>`_.
305+
If character arrays are used:
306+
307+
- The string encoding that was used is stored on
308+
disk in the ``_Encoding`` attribute, which matches an ad-hoc convention
309+
`adopted by the netCDF4-Python library <https://github.com/Unidata/netcdf4-python/pull/665>`_.
310+
At the time of this writing (October 2017), a standard convention for indicating
311+
string encoding for character arrays in netCDF files was
312+
`still under discussion <https://github.com/Unidata/netcdf-c/issues/402>`_.
313+
Technically, you can use
314+
`any string encoding recognized by Python <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ if you feel the need to deviate from UTF-8,
315+
by setting the ``_Encoding`` field in ``encoding``. But
316+
`we don't recommend it <http://utf8everywhere.org/>`_.
317+
- The character dimension name can be specifed by the ``char_dim_name`` field of a variable's
318+
``encoding``. If this is not specified the default name for the character dimension is
319+
``'string%s' % data.shape[-1]``. When decoding character arrays from existing files, the
320+
``char_dim_name`` is added to the variables ``encoding`` to preserve if encoding happens, but
321+
the field can be edited by the user.
315322

316323
.. warning::
317324

doc/whats-new.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@ v0.12.2 (unreleased)
2121
Enhancements
2222
~~~~~~~~~~~~
2323

24+
- Character arrays' character dimension name decoding and encoding handled by
25+
``var.encoding['char_dim_name']`` (:issue:`2895`)
26+
By `James McCreight <https://github.com/jmccreight>`_.
27+
2428
Bug fixes
2529
~~~~~~~~~
2630

xarray/coding/strings.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,16 +103,20 @@ def encode(self, variable, name=None):
103103
dims, data, attrs, encoding = unpack_for_encoding(variable)
104104
if data.dtype.kind == 'S' and encoding.get('dtype') is not str:
105105
data = bytes_to_char(data)
106-
dims = dims + ('string%s' % data.shape[-1],)
106+
if 'char_dim_name' in encoding.keys():
107+
char_dim_name = encoding.pop('char_dim_name')
108+
else:
109+
char_dim_name = 'string%s' % data.shape[-1]
110+
dims = dims + (char_dim_name,)
107111
return Variable(dims, data, attrs, encoding)
108112

109113
def decode(self, variable, name=None):
110114
dims, data, attrs, encoding = unpack_for_decoding(variable)
111115

112116
if data.dtype == 'S1' and dims:
117+
encoding['char_dim_name'] = dims[-1]
113118
dims = dims[:-1]
114119
data = char_to_bytes(data)
115-
116120
return Variable(dims, data, attrs, encoding)
117121

118122

xarray/tests/test_coding_strings.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,24 @@ def test_CharacterArrayCoder_encode(data):
107107
assert_identical(actual, expected)
108108

109109

110+
@pytest.mark.parametrize(
111+
['original', 'expected_char_dim_name'],
112+
[
113+
(Variable(('x',), [b'ab', b'cdef']),
114+
'string4'),
115+
(Variable(('x',), [b'ab', b'cdef'], encoding={'char_dim_name': 'foo'}),
116+
'foo')
117+
]
118+
)
119+
def test_CharacterArrayCoder_char_dim_name(original, expected_char_dim_name):
120+
coder = strings.CharacterArrayCoder()
121+
encoded = coder.encode(original)
122+
roundtripped = coder.decode(encoded)
123+
assert encoded.dims[-1] == expected_char_dim_name
124+
assert roundtripped.encoding['char_dim_name'] == expected_char_dim_name
125+
assert roundtripped.dims[-1] == original.dims[-1]
126+
127+
110128
def test_StackedBytesArray():
111129
array = np.array([[b'a', b'b', b'c'], [b'd', b'e', b'f']], dtype='S')
112130
actual = strings.StackedBytesArray(array)

0 commit comments

Comments
 (0)