Skip to content

Add safetensors reader #555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

nenb
Copy link

@nenb nenb commented Apr 22, 2025

⚠️ DO NOT MERGE: ⚠️ This PR should not be merged until support for dtype extensions is added to zarr-python.

Summary

This PR is the first step in #367. It adds a reader for the official Safetensors specification (see specs). Note that HuggingFace currently implements a SafeTensor reader that goes considerably beyond the official specification. Support for this implementation (which in practice the entire ML/AI community relies on) will be added in the next step for #367.

Notes for reviewers

  • This PR was developed in association with an AI coding assistant. Make of that what you will!

  • I have added a notebook to show how everything works. This can be removed in the final version. I have a question at the end of the notebook related to partial get support, which I would appreciate a core dev looking into.

Checklist

  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@TomNicholas TomNicholas added enhancement New feature or request references generation Reading byte ranges from archival files readers labels Apr 22, 2025
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nenb this looks great.

Thank you for using the new ManifestStore abstraction too (despite us not having documented or released it yet 😅) - that will save us time refactoring later.

My comments are just nits, seems like we're waiting on zarr dtype extensions to support everything we want here?

except ImportError:
raise ImportError("The ml_dtypes package is required to read safetensors files. Please install it with pip install virtualizarr[safetensors].")

dtype_map = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just make this a global variable at the top of the file for visibility?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave it as is for now until the Zarr dtype PR is released, and then will chat with you about how best to implement this function.

@nenb nenb force-pushed the add-safetensors-reader branch from e472167 to b08fc84 Compare April 22, 2025 20:34
@nenb nenb temporarily deployed to test-release April 22, 2025 20:34 — with GitHub Actions Inactive
@nenb nenb force-pushed the add-safetensors-reader branch from b08fc84 to 6d37b17 Compare April 22, 2025 20:37
@nenb nenb temporarily deployed to test-release April 22, 2025 20:37 — with GitHub Actions Inactive
@nenb nenb marked this pull request as ready for review April 22, 2025 21:07
@nenb nenb temporarily deployed to test-release April 23, 2025 20:20 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request readers references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants