-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix #115: Documentation: How to use the chunk cache; add a "performan…
…ce" section
- Loading branch information
Apollo3zehn
committed
Jun 24, 2024
1 parent
0fb3e31
commit 023d9e4
Showing
3 changed files
with
55 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Performance | ||
|
||
The HDF5 file format is a good choice for storing compressed data in chunks, which can actually improve speed by reducing the amount of data that needs to be read from disk. However, chunked data can only be accessed as a whole (due to compression), so it is important to choose an appropriate chunk size. | ||
|
||
For example, consider a two-dimensional data set with dimensions of `10.000 x 100`. This dataset will be filled with real-time sampled data where the first dimension is the time axis, i.e. there will be `100` values per sample. You could now choose a chunk size of `1 x 100`, which means that the chunk is the size of a single sample. This chunk size works well for write operations. | ||
|
||
But when the measurement is finished, you want to read the data back into memory. Often the access pattern is different: instead of writing the data row by row, you now want to read the data column by column. This is often the case with measurement systems where you have tens or hundreds of individual channels. In our example, we want to read the first column - the first channel - with dimensions `10.000 x 1`. | ||
|
||
To do this, PureHDF has to open all the `10.000` chunks, decompress them, extract the first values and collect them in the final array. This will severely degrade performance. If the chunk size had been set to `10.000 x 1` instead, it would have been a single read operation and much faster overall. At the same time, however, the write performance is greatly reduced, because now `100` individual chunks have to be accessed per sample, where *access* means `decompress -> append new value -> compress`. | ||
|
||
The best solution to this problem is to use chunk caches. A chunk cache holds the decompressed data, so in case of a write operation the pattern `decompress -> append new value -> compress` changes to `find chunk in cache -> append new value`, which is generally much faster. | ||
|
||
Performance will be degraded again if the chunk cache is too small. The default *reading* chunk cache properties are | ||
|
||
|
||
- `521` chunk entries | ||
- with a maximum total size of `1 MB` | ||
|
||
## Reading | ||
|
||
The default implementation of the `IReadingChunkCache` interface is the `SimpleReadingChunkCache`. You can change the parameters of this cache or replace it entirely with your own implementation as follows: | ||
|
||
```cs | ||
var dataset = (NativeDataset)file.Dataset("/the/dataset"); | ||
|
||
var datasetAccess = new H5DatasetAccess( | ||
ChunkCache: new SimpleReadingChunkCache( | ||
chunkSlotCount: 521, | ||
byteCount: 1 * 1024 * 1024 | ||
) | ||
) | ||
|
||
dataset.Read<T>(datasetAccess, ...); | ||
``` | ||
|
||
## Writing | ||
|
||
The default implementation of the `IWritingChunkCache` interface is the `SimpleWritingChunkCache` which has no chunk count or chunk size limits because a current PureHDF limitation is that chunks can only be written once. Therefore the chunk cache **must** hold all data in memory until all other file structures are written. | ||
|
||
If you want to use your own implementation of `IWritingChunkCache`, you can provide it in the `H5Dataset` constructor like this: | ||
|
||
```cs | ||
var datasetCreation = new H5DatasetCreation( | ||
ChunkCache: <your own chunk cache> | ||
) | ||
|
||
var dataset = new H5Dataset(..., datasetCreation: datasetCreation); | ||
``` | ||
|
||
> [!NOTE] | ||
> Alternatively, you can provide the chunk caches in a central place: `ChunkCache.DefaultReadingChunkCacheFactory` for reading, or `ChunkCache.DefaultWritingChunkCacheFactory` for writing. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,5 +10,8 @@ | |
- name: Filters | ||
href: filters.md | ||
|
||
- name: Performance | ||
href: performance.md | ||
|
||
- name: API | ||
href: api/ |