Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions doc/optimising.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ HDF5 files can be large and complicated, with complex internal structures which
These complexities (and the overheads they introduce) can be mitigated by optimising how you access the data, but this requires an understanding of
how the data is stored in the file and how the data access library (in this case ``pyfive``) works.

The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files
The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files.

**Chunking**: HDF5 files can store data in chunks, which allows for more efficient access to large datasets.
However, this also means that the library needs to maintain an index (a "b-tree") which relates the position in coordinate space to where each chunk is stored in the file.
Expand All @@ -20,7 +20,7 @@ Optimising the files themselves
-------------------------------

Optimal access to data occurs when the data is chunked in a way that matches the access patterns of your application, and when the
b-tree indexes and attributess are stored contiguously in the file.
b-tree indexes and attributes are stored contiguously in the file.

Users of ``pyfive`` will always confront data files which have been created by other software, but if possible, it is worth exploring whether
the `h5repack <https://docs.h5py.org/en/stable/special.html#h5repack>`_ tool can
Expand All @@ -41,7 +41,7 @@ For example, instead of doing:
import pyfive

with pyfive.File("data.h5", "r") as f:
variables = [f for var in f]
variables = [var for var in f]
print("Variables in file:", variables)
temp = variables['temp']

Expand Down Expand Up @@ -85,7 +85,7 @@ For example, you can use the `concurrent.futures` module to read data from multi
print("Results:", results)


You can do the same thing to parallelise manipuations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
You can do the same thing to parallelise manipulations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.


Using pyfive with S3
Expand Down
2 changes: 1 addition & 1 deletion pyfive/h5d.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ def index(self):
return self._index
#### The following method can be used to set pseudo chunking size after the
#### file has been closed and before data transactions. This is pyfive specific
def set_psuedo_chunk_size(self, newsize_MB):
def set_pseudo_chunk_size(self, newsize_MB):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnlawrence was well thorough - he wrore psuedo in the test call too 🤣

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least I got the test name right ... I think ... :-)

""" Set pseudo chunking size for contiguous variables.
This is a ``pyfive`` API extension.
The default value is 4 MB which should be suitable for most applications.
Expand Down
2 changes: 1 addition & 1 deletion tests/test_pseudochunking.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def setup_data():
with pyfive.File(file_like,'r') as f:
var1 = f['var1']
# use 100 KB as the chunk size
var1.id.set_psuedo_chunk_size(0.1)
var1.id.set_pseudo_chunk_size(0.1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnlawrence no, they were both wrong, that's why the test was passing 🤣

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah the the test name -> yes, sorry, my bda 😁


return var1, data

Expand Down
Loading