NCAS-CMS · valeriupredoi · Sep 16, 2025 · Aug 20, 2025 · Aug 20, 2025 · Aug 20, 2025
diff --git a/Changelog.rst b/Changelog.rst
@@ -1,3 +1,11 @@
+Version 0.6.0
+-------------
+
+**2025-09-XY**
+
+* Enumeration Support (https://github.com/NCAS-CMS/pyfive/issues/85 by 
+  `Bryan Lawrence <https://github.com/bnlawrence>`_)
+
 Version 0.5.1
 -------------
 
@@ -11,8 +19,8 @@ Version 0.5.1
   iterating (https://github.com/NCAS-CMS/pyfive/pull/83 by `Kai
   Mühlbauer <https://github.com/kmuehlbauer>`_)
 * Add documentation for Pyfive
-  (https://github.com/NCAS-CMS/pyfive/pull/81 by `Valeriu Predoi
-  <https://github.com/valeriupredoi>`_)
+  (https://github.com/NCAS-CMS/pyfive/pull/81 by `Bryan Lawrence
+  <https://github.com/bnlawrence>`_)
 * Setup documentation builds on Readthedocs
   (https://github.com/NCAS-CMS/pyfive/pull/80 by `Valeriu Predoi
   <https://github.com/valeriupredoi>`_)
@@ -46,7 +54,7 @@ Version 0.5.0
   `Valeriu Predoi <https://github.com/valeriupredoi>`_)
 * Functionality enhancements to address lazy loading of chunked data,
   variable length strings, and other minor bug fixes
-  (https://github.com/NCAS-CMS/pyfive/pull/68 by `Brian Lawrence
+  (https://github.com/NCAS-CMS/pyfive/pull/68 by `Bryan Lawrence
   <https://github.com/bnlawrence>`_)
 
 ----

diff --git a/doc/api_reference.rst b/doc/api_reference.rst
@@ -22,3 +22,20 @@ API Reference
 .. autoclass:: pyfive.Datatype
    :members:
    :noindex:
+
+----
+
+The h5t module
+--------------
+
+Partial implementation of some of the lower level h5py API, needed
+to support enumerations and variable length strings.
+
+.. autofunction:: pyfive.h5t.check_enum_dtype
+
+.. autofunction:: pyfive.h5t.check_string_dtype
+
+.. autofunction:: pyfive.h5t.check_dtype
+
+.. autoclass:: pyfive.h5t.TypeEnumID
+
diff --git a/doc/quickstart/enums.rst b/doc/quickstart/enums.rst
@@ -0,0 +1,76 @@
+Enumerations
+------------
+
+HDF5 has the concept of an enumeration data type, where integer values are stored in an array, but where those integer
+values should be interpreted as the indexes to some string values.  So, for example, one could have
+an enumeration dictionary (`enum_dict`) defined as 
+
+.. code-block:: python
+
+ clouds = ['stratus','strato-cumulus','missing','nimbus','cumulus','longcloudname']
+ enum_dict =  {v:k for k,v in enumerate(clouds)}
+ enum_dict['missing'] = 255
+
+And an array of data which looked something like
+
+.. code-block:: python
+
+ cloud_cover = [0,3,4,4,4,1,255,1,1]
+
+Which one would expect to interpret as 
+
+.. code-block:: python
+
+ actual_cloud_cover = ['stratus','nimbus','cumulus','cumulus','cumulus',
+                        'stratus','missing','strato-cumulus','strato-cumulus']
+
+These data are stored in HDF5 using a combination of an integer
+valued array and a stored dictionary which is used for the enumeration.
+When the data is read, the integer array has a special numpy datatype, with
+the enumeration dictionary stored as metadata on the data type.
+
+The enumeration dictionary itself can be stored as a ``Datatype``, but it
+doesn't need to be and nor is it necessary to use that datatype to
+use an enumeration variable (the enumeration is not stored as a normal data
+variable and so can be stored without using a Datatype object in the file). 
+So, while finding a Datatype in your HDF5 file is probably an indication
+that you have an enumeration (or some other complication) in the file,
+it is not necessary to do anything with it if it is an enumeration datatype.
+
+Whether or not there is an enumeration DataType in the file, one can only find out 
+if any integer data array read from a data file is linked to an 
+enumeration by checking it's data type using :meth:`pyfive.check_enum_dtype` as shown 
+in the following example:
+
+.. code-block:: python
+
+ with pyfive.File('myfile.h5') as pfile:
+
+    evar = pfile['evar']
+    edict = pyfive.check_enum_dtype(evar.dtype)
+    if edict is None:
+        pass # not an enumeration
+    else:
+        # for some reason HDF5 defines these in what seems to be the wrong way around,
+        # with the string values as keys to the integer indices.
+        edict_reverse = {v:k for k,v in edict.items()}
+        # assuming evar data is a one dimensional array of integers
+        edata = [edict_reverse[k] for k in evar[:]]
+
+In this instance, `edata` would now be a array of strings indexed from the enumeration dictionary using
+the `evar` data as the index values.
+
+(`h5py` and hence `pyfive` have both used an internal numpy dtype metadata feature to implement enumerations.
+Numpy is not clear on the future of this feature, and doesn't promise to transfer metadata with all operations,
+so the output of operations on this integer array may lose the direct link to the enumeration via the dtype. 
+Meanwhile, as well as using the `check_enum_dtype`, you can also get to this dictionary directly yourself, 
+it's available at ``evar.dtype.metadata['enum']``.)
+
+
+
+
+
+
+
+
+
diff --git a/doc/quickstart/index.rst b/doc/quickstart/index.rst
@@ -6,4 +6,5 @@ Getting started
 
     Installation <installation>
     Usage <usage>
+    Enumerations <enums>
 
diff --git a/pyfive/dataobjects.py b/pyfive/dataobjects.py
@@ -238,8 +238,13 @@ def _parse_attribute_msg(self, buffer, offset):
 
         if shape == ():
             value = value[0]
+        elif isinstance(value,dict):
+            pass
         else:
-            value = value.reshape(shape)
+            try:
+                value = value.reshape(shape)
+            except AttributeError:
+                pass
 
         return name, value
 
@@ -267,6 +272,8 @@ def _attr_value(self, dtype, buf, count, offset):
                     vlen, vlen_data = self._vlen_size_and_data(buf, offset)
                     value[i] = self._attr_value(base_dtype, vlen_data, vlen, 0)
                     offset += 16
+                elif dtype_class == 'ENUMERATION':
+                    return np.dtype(dtype[1],metadata={'enum':dtype[2]})
                 else:
                     raise NotImplementedError
         else:
@@ -335,11 +342,12 @@ def fillvalue(self):
 
         if size:
             if isinstance(self.dtype, tuple):
-                try:
-                    assert self.dtype[0] == 'VLEN_STRING'
-                except:
-                    raise ValueError('Unrecognised fill type')
-                fillvalue = self._attr_value(self.dtype, self.msg_data, 1, offset)[0]
+                if self.dtype[0] == 'VLEN_STRING':
+                    fillvalue = self._attr_value(self.dtype, self.msg_data, 1, offset)[0]
+                elif self.dtype[0] in ['ENUMERATION']:
+                    fillvalue = 0
+                else:
+                    raise ValueError(f'Unrecognised dtype [{self.dtype}]')
             else:
                 payload = self.msg_data[offset:offset+size]
                 fillvalue = np.frombuffer(payload, self.dtype, count=1)[0]
@@ -352,8 +360,9 @@ def dtype(self):
         """ Datatype of the dataset. """
         msg = self.find_msg_type(DATATYPE_MSG_TYPE)[0]
         msg_offset = msg['offset_to_message']
-        return DatatypeMessage(self.msg_data, msg_offset).dtype
-
+        dtype = DatatypeMessage(self.msg_data, msg_offset).dtype
+        return dtype
+
     @property
     def chunks(self):
         """ Tuple describing the chunk size, None if not chunked. """

diff --git a/pyfive/datatype_msg.py b/pyfive/datatype_msg.py
@@ -5,6 +5,8 @@
 from .core import _padded_size, _structure_size, _unpack_struct_from
 from .core import InvalidHDF5File
 
+import numpy as np
+import warnings
 
 class DatatypeMessage(object):
     """ Representation of a HDF5 Datatype Message. """
@@ -39,8 +41,7 @@ def determine_dtype(self):
         elif datatype_class == DATATYPE_REFERENCE:
             return ('REFERENCE', datatype_msg['size'])
         elif datatype_class == DATATYPE_ENUMERATED:
-            raise NotImplementedError(
-                "Enumerated datatype class not supported.")
+            return self._determine_dtype_enum(datatype_msg)
         elif datatype_class == DATATYPE_ARRAY:
             raise NotImplementedError("Array datatype class not supported.")
         elif datatype_class == DATATYPE_VARIABLE_LENGTH:
@@ -97,6 +98,7 @@ def _determine_dtype_floating_point(self, datatype_msg):
 
         return byte_order_char + dtype_char + str(length_in_bytes)
 
+
     @staticmethod
     def _determine_dtype_string(datatype_msg):
         """ Return the NumPy dtype for a string class. """
@@ -109,10 +111,12 @@ def _determine_dtype_compound(self, datatype_msg):
         n_comp = bit_field_0 + (bit_field_1 << 4)
 
         # read in the members of the compound datatype
+        # at the moment we need to skip two bytes which I do
+        # 
         members = []
         for _ in range(n_comp):
             null_location = self.buf.index(b'\x00', self.offset)
-            name_size = _padded_size(null_location - self.offset, 8)
+            name_size = _padded_size(null_location - self.offset + 1, 8)
             name = self.buf[self.offset:self.offset+name_size]
             name = name.strip(b'\x00').decode('utf-8')
             self.offset += name_size
@@ -155,7 +159,7 @@ def _determine_dtype_compound(self, datatype_msg):
             if names_valid and dtypes_valid and offsets_valid and props_valid:
                 return complex_dtype_map[dtype1]
 
-        raise NotImplementedError("Compond dtype not supported.")
+        raise NotImplementedError("Compound dtype not supported.")
 
     @staticmethod
     def _determine_dtype_vlen(datatype_msg):
@@ -167,6 +171,35 @@ def _determine_dtype_vlen(datatype_msg):
         character_set = datatype_msg['class_bit_field_1'] & 0x01
         return ('VLEN_STRING', padding_type, character_set)
 
+    def _determine_dtype_enum(self,datatype_msg):
+        """ Return the basetype and the underlying enum dictionary """
+        #FIXME: Consider overlap with the compound code, refactor in some way?
+        # Doing this rather than what is done in compound data type as doing that is opaque and risky
+        enum_msg = _unpack_struct_from(ENUM_DATATYPE_MSG, self.buf, self.offset-DATATYPE_MSG_SIZE)
+        num_members = enum_msg['number_of_members']
+        value_size = enum_msg['size']
+        enum_keys = []
+        dtype = DatatypeMessage(self.buf,self.offset).dtype
+        self.offset+=12   
+        # An extra 4 bytes are read as part of establishing the data type
+        # FIXME:ENUM Need to be sure that some other base type in the future
+        # wouldn't silently need more bytes and screw this all up. Should 
+        # probably put some check/error handling around this.
+        # now get the keys
+        version = (datatype_msg['class_and_version'] >> 4) & 0x0F
+        for _ in range(num_members):
+            null_location = self.buf.index(b'\x00', self.offset)
+            name_size = null_location - self.offset + 1 if version == 3 else _padded_size(null_location - self.offset+ 1, 8)
+            name = self.buf[self.offset:self.offset+name_size]
+            name = name.strip(b'\x00').decode('ascii')
+            self.offset += name_size
+            enum_keys.append(name)
+        #now get the values
+        nbytes = value_size*num_members
+        values = np.frombuffer(self.buf[self.offset:], dtype=dtype, count=num_members)
+        enum_dict = {k:v for k,v in zip(enum_keys,values)}
+        return 'ENUMERATION', dtype, enum_dict
+
 
 # IV.A.2.d The Datatype Message
 
@@ -177,8 +210,16 @@ def _determine_dtype_vlen(datatype_msg):
     ('class_bit_field_2', 'B'),
     ('size', 'I'),
 ))
+
 DATATYPE_MSG_SIZE = _structure_size(DATATYPE_MSG)
 
+ENUM_DATATYPE_MSG = OrderedDict((
+    ('class_and_version', 'B'),
+    ('number_of_members', 'H'),  # 'H' is a 16-bit unsigned integer
+    ('unused', 'B'),
+    ('size', 'I'),
+))
+
 
 COMPOUND_PROP_DESC_V1 = OrderedDict((
     ('offset', 'I'),

diff --git a/pyfive/h5d.py b/pyfive/h5d.py
@@ -85,8 +85,10 @@ def __init__(self, dataobject, pseudo_chunking_size_MB=4):
         self._unique = (self._filename, self.shape, self._msg_offset)
 
         if isinstance(dataobject.dtype,tuple):
-            # this may not behave the same as h5py, do we care? #FIXME
-            self._dtype = dataobject.dtype
+            if dataobject.dtype[0] == 'ENUMERATION':
+                self._dtype = np.dtype(dataobject.dtype[1], metadata={'enum':dataobject.dtype[2]})
+            else:
+                self._dtype = dataobject.dtype
         else:
             self._dtype = np.dtype(dataobject.dtype)
 
@@ -312,34 +314,7 @@ def _build_index(self, dataobject):
 
     def _get_contiguous_data(self, args):
 
-        if not isinstance(self._dtype, tuple):
-            if not self.posix:
-                # Not posix
-                return self._get_direct_from_contiguous(args)
-            else:
-                # posix
-                try:
-                    # Create a memory-map to the stored array, which
-                    # means that we will end up only copying the
-                    # sub-array into in memory.
-                    fh = self._fh
-                    view = np.memmap(
-                        fh,
-                        dtype=self._dtype,
-                        mode='c',
-                        offset=self.data_offset,
-                        shape=self.shape,
-                        order=self._order
-                    )
-                    # Create the sub-array
-                    result = view[args]
-                    # Copy the data from disk to physical memory
-                    result = result.view(type=np.ndarray)
-                    fh.close()
-                    return result
-                except UnsupportedOperation:
-                    return self._get_direct_from_contiguous(args)
-        else:
+        if isinstance(self._dtype, tuple):
             dtype_class = self._dtype[0]
             if dtype_class == 'REFERENCE':
                 size = self._dtype[1]
@@ -371,6 +346,33 @@ def _get_contiguous_data(self, args):
             else:
                 raise NotImplementedError(f'datatype not implemented - {dtype_class}')
 
+        if not self.posix:
+            # Not posix
+            return self._get_direct_from_contiguous(args)
+        else:
+            # posix
+            try:
+                # Create a memory-map to the stored array, which
+                # means that we will end up only copying the
+                # sub-array into in memory.
+                fh = self._fh
+                view = np.memmap(
+                    fh,
+                    dtype=self._dtype,
+                    mode='c',
+                    offset=self.data_offset,
+                    shape=self.shape,
+                    order=self._order
+                )
+                # Create the sub-array
+                result = view[args]
+                # Copy the data from disk to physical memory
+                result = result.view(type=np.ndarray)
+                fh.close()
+                return result
+            except UnsupportedOperation:
+                return self._get_direct_from_contiguous(args)
+
 
     def _get_direct_from_contiguous(self, args=None):
         """
@@ -569,9 +571,9 @@ def _fh(self):
 
     @property
     def dtype(self):
-        if isinstance(self._dtype, tuple) and self._dtype[0] == 'VLEN_STRING':
-            return np.dtype("O")
-
+        if isinstance(self._dtype, tuple):
+            if  self._dtype[0] == 'VLEN_STRING':
+                return np.dtype("O")
         return self._dtype
Original file line number	Diff line number	Diff line change
Expand Up		@@ -6,4 +6,5 @@ Getting started

		Installation <installation>
		Usage <usage>
		Enumerations <enums>