zarr-developers · nenb · Apr 18, 2025 · Apr 22, 2025 · Apr 22, 2025 · Apr 22, 2025
diff --git a/.gitignore b/.gitignore
@@ -190,3 +190,6 @@ venv
 
 # Pixi lock file (because it changes with every upstream commit)
 pixi.lock
+
+# AI-assisted code development: https://github.com/ezyang/codemcp
+codemcp*
diff --git a/docs/faq.md b/docs/faq.md
@@ -92,6 +92,10 @@ You can also use this approach to write a reader that starts from a kerchunk-for
 
 Currently if you want to call your new reader from `virtualizarr.open_virtual_dataset` you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see [issue #245](https://github.com/zarr-developers/VirtualiZarr/issues/245)).
 
+### What ML/AI model formats are supported?
+
+VirtualiZarr has built-in support for [SafeTensors](safetensors.md) files, which are commonly used for storing ML model weights in a safe, efficient format.
+
 ## How does this actually work?
 
 I'm glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of archival files in some other format as a series of steps:

diff --git a/docs/index.md b/docs/index.md
@@ -15,7 +15,7 @@ VirtualiZarr aims to make the creation of cloud-optimized virtualized zarr data
 ## Features
 
 * Create virtual references pointing to bytes inside a archival file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets),
-* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5,
+* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4, HDF5, and [SafeTensors](safetensors.md),
 * [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger store using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html),
 * Commit the virtual references to storage either using the [Kerchunk references](https://fsspec.github.io/kerchunk/spec.html) specification or the [Icechunk](https://icechunk.io/) transactional storage engine.
 * Users access the virtual dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset).
@@ -79,6 +79,7 @@ self
 installation
 usage
 examples
+safetensors
 faq
 api
 releases

diff --git a/docs/safetensors.md b/docs/safetensors.md
@@ -0,0 +1,134 @@
+# SafeTensors Reader User Guide
+
+The SafeTensors reader in VirtualiZarr allows you to reference tensors stored in SafeTensors files. This guide explains how to use the reader effectively.
+
+## What is SafeTensors Format?
+
+SafeTensors is a file format developed by HuggingFace for storing tensors (multidimensional arrays)
+that offers several advantages:
+- Safe: No use of pickle, eliminating security concerns
+- Efficient: Zero-copy access for fast loading
+- Simple: Straightforward binary format with JSON header
+- Language-agnostic: Available across Python, Rust, C++, and JavaScript
+
+The format consists of:
+- 8 bytes (header size): little-endian uint64 containing the size of the header
+- JSON header: Contains metadata for all tensors (shapes, dtypes, offsets)
+- Binary data: Contiguous tensor data
+
+## How VirtualiZarr's SafeTensors Reader Works
+
+VirtualiZarr's SafeTensors reader allows you to:
+- Create "virtual" Zarr stores pointing to chunks of data inside SafeTensors files
+- Open the virtual zarr stores as xarray DataArrays with named dimensions
+- Access specific slices of tensors from cloud storage
+- Preserve metadata from the original SafeTensors file
+
+## Basic Usage
+
+Opening a SafeTensors file is straightforward:
+
+```python
+import virtualizarr as vz
+
+# Open a SafeTensors file
+vds = vz.open_virtual_dataset("model.safetensors")
+
+# Access tensors as xarray variables
+weight = vds["weight"]
+bias = vds["bias"]
+```
+
+## Custom Dimension Names
+
+By default, dimensions are named generically (e.g., "weight_dim_0", "weight_dim_1"). You can provide custom dimension names for better semantics:
+
+```python
+# Define custom dimension names
+custom_dims = {
+    "weight": ["input_dims", "output_dims"],
+    "bias": ["output_dims"]
+}
+
+# Open with custom dimension names
+vds = vz.open_virtual_dataset(
+    "model.safetensors",
+    virtual_backend_kwargs={"dimension_names": custom_dims}
+)
+
+# Now dimensions have meaningful names
+print(vds["weight"].dims)  # ('input_dims', 'output_dims')
+print(vds["bias"].dims)    # ('output_dims',)
+```
+
+## Loading Specific Variables
+
+You can specify which variables to load as eager arrays instead of virtual references:
+
+```python
+# Load specific variables as eager arrays
+vds = vz.open_virtual_dataset(
+    "model_weights.safetensors",
+    loadable_variables=["small_tensor1", "small_tensor2"]
+)
+
+# These will be loaded as regular numpy arrays
+small_tensor1 = vds["small_tensor1"]
+# Large tensors remain virtual references
+large_tensor = vds["large_tensor"]
+```
+
+## Working with Remote Files
+
+The SafeTensors reader supports reading from the HuggingFace Hub:
+```python
+# HuggingFace Hub
+vds = vz.open_virtual_dataset(
+    "https://huggingface.co/openai-community/gpt2/model.safetensors",
+    virtual_backend_kwargs={"revision": "main"}
+)
+```
+
+It supports reading from object storage:
+
+```python
+# S3
+vds = vz.open_virtual_dataset(
+    "s3://my-bucket/model.safetensors",
+    reader_options={
+        "storage_options": {
+            "key": "ACCESS_KEY",
+            "secret": "SECRET_KEY",
+            "region_name": "us-west-2"
+        }
+    }
+)
+```
+
+## Accessing Metadata
+
+SafeTensors files can contain metadata at the file level and tensor level:
+
+```python
+# Access file-level metadata
+print(vds.attrs)  # File-level metadata
+
+# Access tensor-specific metadata
+print(vds["weight"].attrs)  # Tensor-specific metadata
+
+# Access original SafeTensors dtype information
+original_dtype = vds["weight"].attrs["original_safetensors_dtype"]
+print(f"Original dtype: {original_dtype}")
+```
+
+## Known Limitations
+
+### Performance Considerations
+- Very large tensors (>1GB) are treated as a single chunk, which may impact memory usage when accessing small slices
+- Files with thousands of tiny tensors may have overhead due to metadata handling
+
+## Best Practices
+
+- **For large tensors**: Use slicing to access only the portions you need
+- **For remote files**: Use appropriate credentials and optimize access patterns
+- **For many small tensors**: Consider loading them eagerly using `loadable_variables`
diff --git a/docs/usage.md b/docs/usage.md
@@ -89,6 +89,28 @@ aws_credentials = {"key": ..., "secret": ...}
 vds = open_virtual_dataset("s3://some-bucket/file.nc", reader_options={'storage_options': aws_credentials})
 ```
 
+### Opening different file formats
+
+VirtualiZarr automatically detects the file format based on the file extension or content. Currently supported formats include:
+
+- **NetCDF/HDF5**: Scientific data formats (NetCDF3, NetCDF4/HDF5)
+- **DMRPP**: OPeNDAP Data Access Protocol responses
+- **FITS**: Astronomical data in Flexible Image Transport System format
+- **TIFF**: Tagged Image File Format for geospatial and scientific imagery
+- **SafeTensors**: ML model weights format (`*.safetensors`), see the [SafeTensors guide](safetensors.md) for details
+- **Kerchunk references**: Previously created virtualized references
+
+Each format has specific readers optimized for its structure. For SafeTensors files, additional options like custom dimension naming are available:
+
+```python
+# Open a SafeTensors file with custom dimension names
+custom_dims = {"weight": ["input_features", "output_features"]}
+vds = open_virtual_dataset(
+    "model.safetensors",
+    virtual_backend_kwargs={"dimension_names": custom_dims}
+)
+```
+
 ## Chunk Manifests
 
 In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file.

diff --git a/pyproject.toml b/pyproject.toml
@@ -52,6 +52,11 @@ hdf = [
     "imagecodecs-numcodecs==2024.6.1",
     "obstore>=0.5.1",
 ]
+safetensors = [
+    "safetensors",
+    "ml-dtypes",
+    "obstore>=0.5.1",
+]
 
 # kerchunk-based readers
 hdf5 = [
@@ -71,6 +76,7 @@ fits = [
 ]
 all_readers = [
     "virtualizarr[hdf]",
+    "virtualizarr[safetensors]",
     "virtualizarr[hdf5]",
     "virtualizarr[netcdf3]",
     "virtualizarr[fits]",
@@ -176,7 +182,7 @@ rust = "*"
 run-mypy = { cmd = "mypy virtualizarr" }
 # Using '--dist loadscope' (rather than default of '--dist load' when '-n auto'
 # is used), reduces test hangs that appear to be macOS-related.
-run-tests = { cmd = "pytest -n auto --dist loadscope --run-network-tests --verbose --durations=10" }
+run-tests = { cmd = "pytest -n auto --dist loadscope --run-network-tests --verbose  --durations=10" }
 run-tests-no-network = { cmd = "pytest -n auto --verbose" }
 run-tests-cov = { cmd = "pytest -n auto --run-network-tests --verbose --cov=virtualizarr --cov=term-missing" }
 run-tests-xml-cov = { cmd = "pytest -n auto --run-network-tests --verbose --cov=virtualizarr --cov-report=xml" }
@@ -186,12 +192,12 @@ run-tests-html-cov = { cmd = "pytest -n auto --run-network-tests --verbose --cov
 [tool.pixi.environments]
 min-deps = ["dev", "test", "hdf", "hdf5", "hdf5-lib"] # VirtualiZarr/conftest.py using h5py, so the minimum set of dependencies for testing still includes hdf libs
 # Inherit from min-deps to get all the test commands, along with optional dependencies
-test = ["dev", "test", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore"]
-test-py311 = ["dev", "test", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py311"] # test against python 3.11
-test-py312 = ["dev", "test", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py312"] # test against python 3.12
-minio = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py312", "minio"]
-upstream = ["dev", "test", "hdf", "hdf5", "hdf5-lib", "netcdf3", "upstream", "icechunk-dev"]
-all = ["dev", "test", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "all_readers", "all_writers"]
+test = ["dev", "test", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore"]
+test-py311 = ["dev", "test", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py311"] # test against python 3.11
+test-py312 = ["dev", "test", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py312"] # test against python 3.12
+minio = ["dev", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "py312", "minio"]
+upstream = ["dev", "test", "hdf", "safetensors", "hdf5", "hdf5-lib", "netcdf3", "upstream", "icechunk-dev"]
+all = ["dev", "test", "remote", "hdf", "safetensors", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "all_readers", "all_writers"]
 docs = ["docs"]
 
 # Define commands to run within the docs environment