Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions Changelog.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
Version 0.6.0
-------------

**2025-09-XY**

* Enumeration Support (https://github.com/NCAS-CMS/pyfive/issues/85 by
`Bryan Lawrence <https://github.com/bnlawrence>`_)

Version 0.5.1
-------------

Expand All @@ -11,8 +19,8 @@ Version 0.5.1
iterating (https://github.com/NCAS-CMS/pyfive/pull/83 by `Kai
Mühlbauer <https://github.com/kmuehlbauer>`_)
* Add documentation for Pyfive
(https://github.com/NCAS-CMS/pyfive/pull/81 by `Valeriu Predoi
<https://github.com/valeriupredoi>`_)
(https://github.com/NCAS-CMS/pyfive/pull/81 by `Bryan Lawrence
<https://github.com/bnlawrence>`_)
* Setup documentation builds on Readthedocs
(https://github.com/NCAS-CMS/pyfive/pull/80 by `Valeriu Predoi
<https://github.com/valeriupredoi>`_)
Expand Down Expand Up @@ -46,7 +54,7 @@ Version 0.5.0
`Valeriu Predoi <https://github.com/valeriupredoi>`_)
* Functionality enhancements to address lazy loading of chunked data,
variable length strings, and other minor bug fixes
(https://github.com/NCAS-CMS/pyfive/pull/68 by `Brian Lawrence
(https://github.com/NCAS-CMS/pyfive/pull/68 by `Bryan Lawrence
<https://github.com/bnlawrence>`_)

----
Expand Down
17 changes: 17 additions & 0 deletions doc/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,20 @@ API Reference
.. autoclass:: pyfive.Datatype
:members:
:noindex:

----

The h5t module
--------------

Partial implementation of some of the lower level h5py API, needed
to support enumerations and variable length strings.

.. autofunction:: pyfive.h5t.check_enum_dtype

.. autofunction:: pyfive.h5t.check_string_dtype

.. autofunction:: pyfive.h5t.check_dtype

.. autoclass:: pyfive.h5t.TypeEnumID

76 changes: 76 additions & 0 deletions doc/quickstart/enums.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Enumerations
------------

HDF5 has the concept of an enumeration data type, where integer values are stored in an array, but where those integer
values should be interpreted as the indexes to some string values. So, for example, one could have
an enumeration dictionary (`enum_dict`) defined as

.. code-block:: python

clouds = ['stratus','strato-cumulus','missing','nimbus','cumulus','longcloudname']
enum_dict = {v:k for k,v in enumerate(clouds)}
enum_dict['missing'] = 255

And an array of data which looked something like

.. code-block:: python

cloud_cover = [0,3,4,4,4,1,255,1,1]

Which one would expect to interpret as

.. code-block:: python

actual_cloud_cover = ['stratus','nimbus','cumulus','cumulus','cumulus',
'stratus','missing','strato-cumulus','strato-cumulus']

These data are stored in HDF5 using a combination of an integer
valued array and a stored dictionary which is used for the enumeration.
When the data is read, the integer array has a special numpy datatype, with
the enumeration dictionary stored as metadata on the data type.

The enumeration dictionary itself can be stored as a ``Datatype``, but it
doesn't need to be and nor is it necessary to use that datatype to
use an enumeration variable (the enumeration is not stored as a normal data
variable and so can be stored without using a Datatype object in the file).
So, while finding a Datatype in your HDF5 file is probably an indication
that you have an enumeration (or some other complication) in the file,
it is not necessary to do anything with it if it is an enumeration datatype.

Whether or not there is an enumeration DataType in the file, one can only find out
if any integer data array read from a data file is linked to an
enumeration by checking it's data type using :meth:`pyfive.check_enum_dtype` as shown
in the following example:

.. code-block:: python

with pyfive.File('myfile.h5') as pfile:

evar = pfile['evar']
edict = pyfive.check_enum_dtype(evar.dtype)
if edict is None:
pass # not an enumeration
else:
# for some reason HDF5 defines these in what seems to be the wrong way around,
# with the string values as keys to the integer indices.
edict_reverse = {v:k for k,v in edict.items()}
# assuming evar data is a one dimensional array of integers
edata = [edict_reverse[k] for k in evar[:]]

In this instance, `edata` would now be a array of strings indexed from the enumeration dictionary using
the `evar` data as the index values.

(`h5py` and hence `pyfive` have both used an internal numpy dtype metadata feature to implement enumerations.
Numpy is not clear on the future of this feature, and doesn't promise to transfer metadata with all operations,
so the output of operations on this integer array may lose the direct link to the enumeration via the dtype.
Meanwhile, as well as using the `check_enum_dtype`, you can also get to this dictionary directly yourself,
it's available at ``evar.dtype.metadata['enum']``.)









1 change: 1 addition & 0 deletions doc/quickstart/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ Getting started

Installation <installation>
Usage <usage>
Enumerations <enums>

25 changes: 17 additions & 8 deletions pyfive/dataobjects.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,8 +238,13 @@ def _parse_attribute_msg(self, buffer, offset):

if shape == ():
value = value[0]
elif isinstance(value,dict):
pass
else:
value = value.reshape(shape)
try:
value = value.reshape(shape)
except AttributeError:
pass

return name, value

Expand Down Expand Up @@ -267,6 +272,8 @@ def _attr_value(self, dtype, buf, count, offset):
vlen, vlen_data = self._vlen_size_and_data(buf, offset)
value[i] = self._attr_value(base_dtype, vlen_data, vlen, 0)
offset += 16
elif dtype_class == 'ENUMERATION':
return np.dtype(dtype[1],metadata={'enum':dtype[2]})
else:
raise NotImplementedError
else:
Expand Down Expand Up @@ -335,11 +342,12 @@ def fillvalue(self):

if size:
if isinstance(self.dtype, tuple):
try:
assert self.dtype[0] == 'VLEN_STRING'
except:
raise ValueError('Unrecognised fill type')
fillvalue = self._attr_value(self.dtype, self.msg_data, 1, offset)[0]
if self.dtype[0] == 'VLEN_STRING':
fillvalue = self._attr_value(self.dtype, self.msg_data, 1, offset)[0]
elif self.dtype[0] in ['ENUMERATION']:
fillvalue = 0
else:
raise ValueError(f'Unrecognised dtype [{self.dtype}]')
else:
payload = self.msg_data[offset:offset+size]
fillvalue = np.frombuffer(payload, self.dtype, count=1)[0]
Expand All @@ -352,8 +360,9 @@ def dtype(self):
""" Datatype of the dataset. """
msg = self.find_msg_type(DATATYPE_MSG_TYPE)[0]
msg_offset = msg['offset_to_message']
return DatatypeMessage(self.msg_data, msg_offset).dtype

dtype = DatatypeMessage(self.msg_data, msg_offset).dtype
return dtype

@property
def chunks(self):
""" Tuple describing the chunk size, None if not chunked. """
Expand Down
49 changes: 45 additions & 4 deletions pyfive/datatype_msg.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
from .core import _padded_size, _structure_size, _unpack_struct_from
from .core import InvalidHDF5File

import numpy as np
import warnings

class DatatypeMessage(object):
""" Representation of a HDF5 Datatype Message. """
Expand Down Expand Up @@ -39,8 +41,7 @@ def determine_dtype(self):
elif datatype_class == DATATYPE_REFERENCE:
return ('REFERENCE', datatype_msg['size'])
elif datatype_class == DATATYPE_ENUMERATED:
raise NotImplementedError(
"Enumerated datatype class not supported.")
return self._determine_dtype_enum(datatype_msg)
elif datatype_class == DATATYPE_ARRAY:
raise NotImplementedError("Array datatype class not supported.")
elif datatype_class == DATATYPE_VARIABLE_LENGTH:
Expand Down Expand Up @@ -97,6 +98,7 @@ def _determine_dtype_floating_point(self, datatype_msg):

return byte_order_char + dtype_char + str(length_in_bytes)


@staticmethod
def _determine_dtype_string(datatype_msg):
""" Return the NumPy dtype for a string class. """
Expand All @@ -109,10 +111,12 @@ def _determine_dtype_compound(self, datatype_msg):
n_comp = bit_field_0 + (bit_field_1 << 4)

# read in the members of the compound datatype
# at the moment we need to skip two bytes which I do
#
members = []
for _ in range(n_comp):
null_location = self.buf.index(b'\x00', self.offset)
name_size = _padded_size(null_location - self.offset, 8)
name_size = _padded_size(null_location - self.offset + 1, 8)
name = self.buf[self.offset:self.offset+name_size]
name = name.strip(b'\x00').decode('utf-8')
self.offset += name_size
Expand Down Expand Up @@ -155,7 +159,7 @@ def _determine_dtype_compound(self, datatype_msg):
if names_valid and dtypes_valid and offsets_valid and props_valid:
return complex_dtype_map[dtype1]

raise NotImplementedError("Compond dtype not supported.")
raise NotImplementedError("Compound dtype not supported.")

@staticmethod
def _determine_dtype_vlen(datatype_msg):
Expand All @@ -167,6 +171,35 @@ def _determine_dtype_vlen(datatype_msg):
character_set = datatype_msg['class_bit_field_1'] & 0x01
return ('VLEN_STRING', padding_type, character_set)

def _determine_dtype_enum(self,datatype_msg):
""" Return the basetype and the underlying enum dictionary """
#FIXME: Consider overlap with the compound code, refactor in some way?
# Doing this rather than what is done in compound data type as doing that is opaque and risky
enum_msg = _unpack_struct_from(ENUM_DATATYPE_MSG, self.buf, self.offset-DATATYPE_MSG_SIZE)
num_members = enum_msg['number_of_members']
value_size = enum_msg['size']
enum_keys = []
dtype = DatatypeMessage(self.buf,self.offset).dtype
self.offset+=12
# An extra 4 bytes are read as part of establishing the data type
# FIXME:ENUM Need to be sure that some other base type in the future
# wouldn't silently need more bytes and screw this all up. Should
# probably put some check/error handling around this.
# now get the keys
version = (datatype_msg['class_and_version'] >> 4) & 0x0F
for _ in range(num_members):
null_location = self.buf.index(b'\x00', self.offset)
name_size = null_location - self.offset + 1 if version == 3 else _padded_size(null_location - self.offset+ 1, 8)
name = self.buf[self.offset:self.offset+name_size]
name = name.strip(b'\x00').decode('ascii')
self.offset += name_size
enum_keys.append(name)
#now get the values
nbytes = value_size*num_members
values = np.frombuffer(self.buf[self.offset:], dtype=dtype, count=num_members)
enum_dict = {k:v for k,v in zip(enum_keys,values)}
return 'ENUMERATION', dtype, enum_dict


# IV.A.2.d The Datatype Message

Expand All @@ -177,8 +210,16 @@ def _determine_dtype_vlen(datatype_msg):
('class_bit_field_2', 'B'),
('size', 'I'),
))

DATATYPE_MSG_SIZE = _structure_size(DATATYPE_MSG)

ENUM_DATATYPE_MSG = OrderedDict((
('class_and_version', 'B'),
('number_of_members', 'H'), # 'H' is a 16-bit unsigned integer
('unused', 'B'),
('size', 'I'),
))


COMPOUND_PROP_DESC_V1 = OrderedDict((
('offset', 'I'),
Expand Down
68 changes: 35 additions & 33 deletions pyfive/h5d.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,10 @@ def __init__(self, dataobject, pseudo_chunking_size_MB=4):
self._unique = (self._filename, self.shape, self._msg_offset)

if isinstance(dataobject.dtype,tuple):
# this may not behave the same as h5py, do we care? #FIXME
self._dtype = dataobject.dtype
if dataobject.dtype[0] == 'ENUMERATION':
self._dtype = np.dtype(dataobject.dtype[1], metadata={'enum':dataobject.dtype[2]})
else:
self._dtype = dataobject.dtype
else:
self._dtype = np.dtype(dataobject.dtype)

Expand Down Expand Up @@ -312,34 +314,7 @@ def _build_index(self, dataobject):

def _get_contiguous_data(self, args):

if not isinstance(self._dtype, tuple):
if not self.posix:
# Not posix
return self._get_direct_from_contiguous(args)
else:
# posix
try:
# Create a memory-map to the stored array, which
# means that we will end up only copying the
# sub-array into in memory.
fh = self._fh
view = np.memmap(
fh,
dtype=self._dtype,
mode='c',
offset=self.data_offset,
shape=self.shape,
order=self._order
)
# Create the sub-array
result = view[args]
# Copy the data from disk to physical memory
result = result.view(type=np.ndarray)
fh.close()
return result
except UnsupportedOperation:
return self._get_direct_from_contiguous(args)
else:
if isinstance(self._dtype, tuple):
dtype_class = self._dtype[0]
if dtype_class == 'REFERENCE':
size = self._dtype[1]
Expand Down Expand Up @@ -371,6 +346,33 @@ def _get_contiguous_data(self, args):
else:
raise NotImplementedError(f'datatype not implemented - {dtype_class}')

if not self.posix:
# Not posix
return self._get_direct_from_contiguous(args)
else:
# posix
try:
# Create a memory-map to the stored array, which
# means that we will end up only copying the
# sub-array into in memory.
fh = self._fh
view = np.memmap(
fh,
dtype=self._dtype,
mode='c',
offset=self.data_offset,
shape=self.shape,
order=self._order
)
# Create the sub-array
result = view[args]
# Copy the data from disk to physical memory
result = result.view(type=np.ndarray)
fh.close()
return result
except UnsupportedOperation:
return self._get_direct_from_contiguous(args)


def _get_direct_from_contiguous(self, args=None):
"""
Expand Down Expand Up @@ -569,9 +571,9 @@ def _fh(self):

@property
def dtype(self):
if isinstance(self._dtype, tuple) and self._dtype[0] == 'VLEN_STRING':
return np.dtype("O")

if isinstance(self._dtype, tuple):
if self._dtype[0] == 'VLEN_STRING':
return np.dtype("O")
return self._dtype


Expand Down
Loading
Loading