Description
My impression is that the existing DB backends for Zarr use small DB tables of chunks: each row has a string key (that would otherwise be a filesystem path) and a binary blob (that would otherwise be the compressed chunk file at that path).
I wanted to flag a need I keep seeing in single-cell, and have discussed with various folks (incl. @alasla, @mckinsel, @tomwhite, @laserson), which is to put lots of gene-expression matrices in a database (instead of storing each one in an HDF5 file, CSV, or Zarr directory), where each entry in these 2-D sparse matrices is stored as a database row (likely a (cell ID, gene ID, count) triple).
Generalizing, an N-D Zarr dataset can have each entry mapped to a DB row with N integer "key" columns and one "value" column (storing elements of the given Zarr dtype).
You can straightforwardly support existing Zarr access-patterns by indexing such a table on the "key" columns, and letting Zarr page full chunks into memory to operate on, as usual. Fetching a chunk from such a DB table is a simple DB query against an index (with appropriate chunk-size-multiple bounds on each dimension-column), and downstream code need not care that it is being fed a chunk that is entirely virtual.
This model can also trivially simulate concatenation, splitting, and re-chunking Zarr trees, potentially obviating a host of related problems (e.g. #297, #323, #392), and generally leads to a lot of questions about when you should ever store things in a filesystem instead of a database (possibly never 😝), how core filesystem-assumptions are to the essence of Zarr (not very, IMO, though we haven't really hashed this out), etc.
In any case, I am eager to make a Zarr backend for "entry"-level DBs like this, and will post any progress here. Any thoughts are welcome!