Skip to content

Database sources where each array element is a separate database row #438

Open
@ryan-williams

Description

@ryan-williams

My impression is that the existing DB backends for Zarr use small DB tables of chunks: each row has a string key (that would otherwise be a filesystem path) and a binary blob (that would otherwise be the compressed chunk file at that path).

I wanted to flag a need I keep seeing in single-cell, and have discussed with various folks (incl. @alasla, @mckinsel, @tomwhite, @laserson), which is to put lots of gene-expression matrices in a database (instead of storing each one in an HDF5 file, CSV, or Zarr directory), where each entry in these 2-D sparse matrices is stored as a database row (likely a (cell ID, gene ID, count) triple).

Generalizing, an N-D Zarr dataset can have each entry mapped to a DB row with N integer "key" columns and one "value" column (storing elements of the given Zarr dtype).

You can straightforwardly support existing Zarr access-patterns by indexing such a table on the "key" columns, and letting Zarr page full chunks into memory to operate on, as usual. Fetching a chunk from such a DB table is a simple DB query against an index (with appropriate chunk-size-multiple bounds on each dimension-column), and downstream code need not care that it is being fed a chunk that is entirely virtual.

This model can also trivially simulate concatenation, splitting, and re-chunking Zarr trees, potentially obviating a host of related problems (e.g. #297, #323, #392), and generally leads to a lot of questions about when you should ever store things in a filesystem instead of a database (possibly never 😝), how core filesystem-assumptions are to the essence of Zarr (not very, IMO, though we haven't really hashed this out), etc.

In any case, I am eager to make a Zarr backend for "entry"-level DBs like this, and will post any progress here. Any thoughts are welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew features or improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions