Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Don't read entire chunks at a time #57

Closed
@JackKelly

Description

@JackKelly

Because we're reading entire chunks at a time, we're reading and decompressing a lot more data than we need, which is almost certainly a major bottleneck.

Some ideas:

  • Try again to use zarr.core.Array(partial_decompress=True) using FSStore & Blosc compression (we're already using Blosc zstd level 5 for NWPs) to read small chunks, especially for NWPs where we're using tiny 2x2 images.
  • Try with uncompressed Zarr. It's possible that the performance increase from compression is far smaller than the performance increase of being able to precisely extract just the data we want from each chunk.

If it's possible to quickly load subsets of each chunk, then modify each Zarr DataSource so it no longer pre-reads entire chunks into memory, but instead, for each batch, loads each example separately using a different thread.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions