Skip to content

Add safetensors reader #555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -190,3 +190,6 @@ venv

# Pixi lock file (because it changes with every upstream commit)
pixi.lock

# AI-assisted code development: https://github.com/ezyang/codemcp
codemcp*
4 changes: 4 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,10 @@ You can also use this approach to write a reader that starts from a kerchunk-for

Currently if you want to call your new reader from `virtualizarr.open_virtual_dataset` you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see [issue #245](https://github.com/zarr-developers/VirtualiZarr/issues/245)).

### What ML/AI model formats are supported?

VirtualiZarr has built-in support for [SafeTensors](safetensors.md) files, which are commonly used for storing ML model weights in a safe, efficient format.

## How does this actually work?

I'm glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of archival files in some other format as a series of steps:
Expand Down
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ VirtualiZarr aims to make the creation of cloud-optimized virtualized zarr data
## Features

* Create virtual references pointing to bytes inside a archival file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets),
* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5,
* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4, HDF5, and [SafeTensors](safetensors.md),
* [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger store using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html),
* Commit the virtual references to storage either using the [Kerchunk references](https://fsspec.github.io/kerchunk/spec.html) specification or the [Icechunk](https://icechunk.io/) transactional storage engine.
* Users access the virtual dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset).
Expand Down Expand Up @@ -79,6 +79,7 @@ self
installation
usage
examples
safetensors
faq
api
releases
Expand Down
134 changes: 134 additions & 0 deletions docs/safetensors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# SafeTensors Reader User Guide

The SafeTensors reader in VirtualiZarr allows you to reference tensors stored in SafeTensors files. This guide explains how to use the reader effectively.

## What is SafeTensors Format?

SafeTensors is a file format developed by HuggingFace for storing tensors (multidimensional arrays)
that offers several advantages:
- Safe: No use of pickle, eliminating security concerns
- Efficient: Zero-copy access for fast loading
- Simple: Straightforward binary format with JSON header
- Language-agnostic: Available across Python, Rust, C++, and JavaScript

The format consists of:
- 8 bytes (header size): little-endian uint64 containing the size of the header
- JSON header: Contains metadata for all tensors (shapes, dtypes, offsets)
- Binary data: Contiguous tensor data

## How VirtualiZarr's SafeTensors Reader Works

VirtualiZarr's SafeTensors reader allows you to:
- Create "virtual" Zarr stores pointing to chunks of data inside SafeTensors files
- Open the virtual zarr stores as xarray DataArrays with named dimensions
- Access specific slices of tensors from cloud storage
- Preserve metadata from the original SafeTensors file

## Basic Usage

Opening a SafeTensors file is straightforward:

```python
import virtualizarr as vz

# Open a SafeTensors file
vds = vz.open_virtual_dataset("model.safetensors")

# Access tensors as xarray variables
weight = vds["weight"]
bias = vds["bias"]
```

## Custom Dimension Names

By default, dimensions are named generically (e.g., "weight_dim_0", "weight_dim_1"). You can provide custom dimension names for better semantics:

```python
# Define custom dimension names
custom_dims = {
"weight": ["input_dims", "output_dims"],
"bias": ["output_dims"]
}

# Open with custom dimension names
vds = vz.open_virtual_dataset(
"model.safetensors",
virtual_backend_kwargs={"dimension_names": custom_dims}
)

# Now dimensions have meaningful names
print(vds["weight"].dims) # ('input_dims', 'output_dims')
print(vds["bias"].dims) # ('output_dims',)
```

## Loading Specific Variables

You can specify which variables to load as eager arrays instead of virtual references:

```python
# Load specific variables as eager arrays
vds = vz.open_virtual_dataset(
"model_weights.safetensors",
loadable_variables=["small_tensor1", "small_tensor2"]
)

# These will be loaded as regular numpy arrays
small_tensor1 = vds["small_tensor1"]
# Large tensors remain virtual references
large_tensor = vds["large_tensor"]
```

## Working with Remote Files

The SafeTensors reader supports reading from the HuggingFace Hub:
```python
# HuggingFace Hub
vds = vz.open_virtual_dataset(
"https://huggingface.co/openai-community/gpt2/model.safetensors",
virtual_backend_kwargs={"revision": "main"}
)
```

It supports reading from object storage:

```python
# S3
vds = vz.open_virtual_dataset(
"s3://my-bucket/model.safetensors",
reader_options={
"storage_options": {
"key": "ACCESS_KEY",
"secret": "SECRET_KEY",
"region_name": "us-west-2"
}
}
)
```

## Accessing Metadata

SafeTensors files can contain metadata at the file level and tensor level:

```python
# Access file-level metadata
print(vds.attrs) # File-level metadata

# Access tensor-specific metadata
print(vds["weight"].attrs) # Tensor-specific metadata

# Access original SafeTensors dtype information
original_dtype = vds["weight"].attrs["original_safetensors_dtype"]
print(f"Original dtype: {original_dtype}")
```

## Known Limitations

### Performance Considerations
- Very large tensors (>1GB) are treated as a single chunk, which may impact memory usage when accessing small slices
- Files with thousands of tiny tensors may have overhead due to metadata handling

## Best Practices

- **For large tensors**: Use slicing to access only the portions you need
- **For remote files**: Use appropriate credentials and optimize access patterns
- **For many small tensors**: Consider loading them eagerly using `loadable_variables`
22 changes: 22 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,28 @@ aws_credentials = {"key": ..., "secret": ...}
vds = open_virtual_dataset("s3://some-bucket/file.nc", reader_options={'storage_options': aws_credentials})
```

### Opening different file formats

VirtualiZarr automatically detects the file format based on the file extension or content. Currently supported formats include:

- **NetCDF/HDF5**: Scientific data formats (NetCDF3, NetCDF4/HDF5)
- **DMRPP**: OPeNDAP Data Access Protocol responses
- **FITS**: Astronomical data in Flexible Image Transport System format
- **TIFF**: Tagged Image File Format for geospatial and scientific imagery
- **SafeTensors**: ML model weights format (`*.safetensors`), see the [SafeTensors guide](safetensors.md) for details
- **Kerchunk references**: Previously created virtualized references

Each format has specific readers optimized for its structure. For SafeTensors files, additional options like custom dimension naming are available:

```python
# Open a SafeTensors file with custom dimension names
custom_dims = {"weight": ["input_features", "output_features"]}
vds = open_virtual_dataset(
"model.safetensors",
virtual_backend_kwargs={"dimension_names": custom_dims}
)
```

## Chunk Manifests

In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file.
Expand Down
20 changes: 13 additions & 7 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,11 @@ hdf = [
"imagecodecs-numcodecs==2024.6.1",
"obstore>=0.5.1",
]
safetensors = [
"safetensors",
"ml-dtypes",
"obstore>=0.5.1",
]

# kerchunk-based readers
hdf5 = [
Expand All @@ -71,6 +76,7 @@ fits = [
]
all_readers = [
"virtualizarr[hdf]",
"virtualizarr[safetensors]",
"virtualizarr[hdf5]",
"virtualizarr[netcdf3]",
"virtualizarr[fits]",
Expand Down Expand Up @@ -176,7 +182,7 @@ rust = "*"
run-mypy = { cmd = "mypy virtualizarr" }
# Using '--dist loadscope' (rather than default of '--dist load' when '-n auto'
# is used), reduces test hangs that appear to be macOS-related.
run-tests = { cmd = "pytest -n auto --dist loadscope --run-network-tests --verbose --durations=10" }
run-tests = { cmd = "pytest -n auto --dist loadscope --run-network-tests --verbose --durations=10" }
run-tests-no-network = { cmd = "pytest -n auto --verbose" }
run-tests-cov = { cmd = "pytest -n auto --run-network-tests --verbose --cov=virtualizarr --cov=term-missing" }
run-tests-xml-cov = { cmd = "pytest -n auto --run-network-tests --verbose --cov=virtualizarr --cov-report=xml" }
Expand All @@ -186,12 +192,12 @@ run-tests-html-cov = { cmd = "pytest -n auto --run-network-tests --verbose --cov
[tool.pixi.environments]
min-deps = ["dev", "test", "hdf", "hdf5", "hdf5-lib"] # VirtualiZarr/conftest.py using h5py, so the minimum set of dependencies for testing still includes hdf libs
# Inherit from min-deps to get all the test commands, along with optional dependencies
test = ["dev", "test", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore"]
test-py311 = ["dev", "test", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py311"] # test against python 3.11
test-py312 = ["dev", "test", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py312"] # test against python 3.12
minio = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py312", "minio"]
upstream = ["dev", "test", "hdf", "hdf5", "hdf5-lib", "netcdf3", "upstream", "icechunk-dev"]
all = ["dev", "test", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "all_readers", "all_writers"]
test = ["dev", "test", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore"]
test-py311 = ["dev", "test", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py311"] # test against python 3.11
test-py312 = ["dev", "test", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py312"] # test against python 3.12
minio = ["dev", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py312", "minio"]
upstream = ["dev", "test", "hdf", "safetensors", "hdf5", "hdf5-lib", "netcdf3", "upstream", "icechunk-dev"]
all = ["dev", "test", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "all_readers", "all_writers"]
docs = ["docs"]

# Define commands to run within the docs environment
Expand Down
Loading
Loading