GitHub - CyrilJl/BatchStats: Python package for efficient, online statistical computations on streaming or large-scale data

batchstats is a Python package for computing statistics on data that arrives in batches. It's perfect for streaming data or datasets too large to fit into memory.

For detailed information, please check out the full documentation.

Installation

Install batchstats using pip:

pip install batchstats

Or with conda:

conda install -c conda-forge batchstats

Quick Start

Here's how to compute the mean and variance of a dataset in batches:

import numpy as np
from batchstats import BatchMean, BatchVar

# Simulate a data stream
data_stream = (np.random.randn(100, 10) for _ in range(10))

# Initialize the stat objects
batch_mean = BatchMean()
batch_var = BatchVar()

# Process each batch
for batch in data_stream:
    batch_mean.update_batch(batch)
    batch_var.update_batch(batch)

# Get the final result
mean = batch_mean()
variance = batch_var()

print(f"Mean shape: {mean.shape}")
print(f"Variance shape: {variance.shape}")

Advanced Usage

batchstats handles n-dimensional np.ndarray inputs and allows specifying multiple axes for reduction, just like numpy.

import numpy as np
from batchstats import BatchMean

# Create a 3D data stream
data_stream = (np.random.rand(10, 5, 8) for _ in range(5))

# Compute the mean over the last two axes (1 and 2)
batch_mean_3d = BatchMean(axis=(1, 2))

for batch in data_stream:
    batch_mean_3d.update_batch(batch)

mean_3d = batch_mean_3d()

print(f"3D Mean shape: {mean_3d.shape}")

Handling NaN Values

batchstats provides BatchNan* classes to handle NaN values, similar to numpy's nan* functions.

import numpy as np
from batchstats import BatchNanMean

# Create data with NaNs
data = np.random.randn(1000, 5)
data[::10] = np.nan

# Compute the mean, ignoring NaNs
nan_mean = BatchNanMean().update_batch(data)()

print(f"NaN-aware mean shape: {nan_mean.shape}")

Available Statistics

batchstats supports a variety of common statistics:

BatchSum / BatchNanSum
BatchMean / BatchNanMean
BatchMin / BatchNanMin
BatchMax / BatchNanMax
BatchPeakToPeak / BatchNanPeakToPeak
BatchVar
BatchStd
BatchCov
BatchCorr

For more details on each class, see the API Reference.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
.github/workflows		.github/workflows
batchstats		batchstats
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
readthedocs.yml		readthedocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Quick Start

Advanced Usage

Handling NaN Values

Available Statistics

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

CyrilJl/BatchStats

Folders and files

Latest commit

History

Repository files navigation

Installation

Quick Start

Advanced Usage

Handling NaN Values

Available Statistics

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages