annbatch

Caution

This pacakge does not have a stable API. However, we do not anticipate the on-disk format in an incompatible manner (since it is normal anndata).

A data loader + io utilities for minibatching on-disk anndata, co-developed by lamin and scverse

Getting started

Please refer to the documentation, in particular, the API documentation.

Installation

You need to have Python 3.11 or newer installed on your system. If you don't have Python installed, we recommend installing uv.

To install the latest release of annbatch from PyPI:

pip install annbatch

We provide extras in the pyproject.toml for torch, cupy-cuda12, cupy-cuda13, and zarrs-python. cupy provides accelerated handling of the data via preload_to_gpu once it has been read off disk and does not need to be used in conjunction with torch.

Important

zarrs-python gives the necessary performance boost for the sharded data produced by our preprocessing functions to be useful when loading data off a local filesystem.

Basic usage example

Basic preprocessing:

from annbatch import create_anndata_collection

import zarr
from pathlib import Path
import zarrs   # noqa: F401

# Using zarrs is necessary for local filesystem perforamnce.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

create_anndata_collection(
    adata_paths=[
        "path/to/your/file1.h5ad",
        "path/to/your/file2.h5ad"
    ],
    output_path="path/to/output/collection", # a directory containing `dataset_{i}.zarr`
    shuffle=True,  # shuffling is needed if you want to use chunked access
)

Data loading:

from pathlib import Path

from annbatch import ZarrSparseDataset
import anndata as ad
import zarr
import zarrs   # noqa: F401

# Using zarrs is necessary for local filesystem perforamnce.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

ds = ZarrSparseDataset(
    batch_size=4096,
    chunk_size=32,
    preload_nchunks=256,
).add_anndatas(
    [
        ad.AnnData(
            # note that you can open an anndata file using any type of zarr store
            X=ad.io.sparse_dataset(zarr.open(p)["X"]),
            obs=ad.io.read_elem(zarr.open(p)["obs"]),
        )
        for p in Path("path/to/output/collection").glob("*.zarr")
    ],
    obs_keys="label_column",
)

# Iterate over dataloader (plugin replacement for torch.utils.DataLoader)
for batch in ds:
    ...

For a deeper dive into this example, please see the in-depth section of our docs

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Citation

t.b.a

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github		.github
.vscode		.vscode
docs		docs
src/annbatch		src/annbatch
tests		tests
.codecov.yaml		.codecov.yaml
.cruft.json		.cruft.json
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
biome.jsonc		biome.jsonc
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

annbatch

Getting started

Installation

Basic usage example

Release notes

Contact

Citation

About

Uh oh!

Releases 1

Packages

Contributors 6

Uh oh!

Languages

License

scverse/annbatch

Folders and files

Latest commit

History

Repository files navigation

annbatch

Getting started

Installation

Basic usage example

Release notes

Contact

Citation

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 6

Uh oh!

Languages

Packages