Description
Hello!
I want to propose adding a new Zarr store type when all array chunks are located in a single binary file. A propotype implementation, named file chunk store, is described in this Medium post. In this approach, Zarr metadata (.zgroup
, .zarray
, .zattrs
, or .zmetadata
) are stored in one of the current Zarr store types while the array chunks are in a binary file. The file chunk store translates array chunk keys into file seek and read operations and therefore only provides read access to the chunk data.
The file chunk store requires a mapping between array chunk keys and their file locations. The prototype implementation put this information for every Zarr array in JSON files named .zchunkstore
. An example is below:
{
"BEAM0001/tx_pulseflag/0": {
"offset": 94854560,
"size": 120
},
"BEAM0001/tx_pulseflag/1": {
"offset": 94854680,
"size": 120
},
"BEAM0001/tx_pulseflag/2": {
"offset": 94854800,
"size": 120
},
"BEAM0001/tx_pulseflag/3": {
"offset": 94854920,
"size": 120
},
"BEAM0001/tx_pulseflag/4": {
"offset": 96634038,
"size": 120
},
"BEAM0001/tx_pulseflag/5": {
"offset": 96634158,
"size": 123
},
"source": {
"array_name": "/BEAM0001/tx_pulseflag",
"uri": "https://e4ftl01.cr.usgs.gov/GEDI/GEDI01_B.001/2019.05.26/GEDI01_B_2019146164739_O02560_T04067_02_003_01.h5"
}
}
Array chunk file location is described with the starting byte (offset
) in the file and the number of bytes to read (size
). Also included is the file information (source
) to enable verification of chunk data provenance. The file chunk store prototype uses file-like Python objects, delegating to users the responsibility to arrange access to correct files.
We can discuss specific implementation details If there is enough interest in this new store type.
Thanks!