Skip to content

Commit 56ea455

Browse files
authored
Update hf-xet version, update related xet docs (#3475)
1 parent 39ebbc0 commit 56ea455

File tree

5 files changed

+12
-12
lines changed

5 files changed

+12
-12
lines changed

docs/source/en/guides/download.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@ Finally, you can also make a dry-run programmatically by passing `dry_run=True`
246246
Take advantage of faster downloads through `hf_xet`, the Python binding to the [`xet-core`](https://github.com/huggingface/xet-core) library that enables
247247
chunk-based deduplication for faster downloads and uploads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS.
248248

249-
`hf_xet` uses the Xet storage system, which breaks files down into immutable chunks, storing collections of these chunks (called blocks or xorbs) remotely and retrieving them to reassemble the file when requested. When downloading, after confirming the user is authorized to access the files, `hf_xet` will query the Xet content-addressable service (CAS) with the LFS SHA256 hash for this file to receive the reconstruction metadata (ranges within xorbs) to assemble these files, along with presigned URLs to download the xorbs directly. Then `hf_xet` will efficiently download the xorb ranges necessary and will write out the files on disk. `hf_xet` uses a local disk cache to only download chunks once, learn more in the [Chunk-based caching(Xet)](./manage-cache#chunk-based-caching-xet) section.
249+
`hf_xet` uses the Xet storage system, which breaks files down into immutable chunks, storing collections of these chunks (called blocks or xorbs) remotely and retrieving them to reassemble the file when requested. When downloading, after confirming the user is authorized to access the files, `hf_xet` will query the Xet content-addressable service (CAS) with the LFS SHA256 hash for this file to receive the reconstruction metadata (ranges within xorbs) to assemble these files, along with presigned URLs to download the xorbs directly. Then `hf_xet` will efficiently download the xorb ranges necessary and will write out the files on disk.
250250

251251
To enable it, simply install the latest version of `huggingface_hub`:
252252

@@ -256,6 +256,6 @@ pip install -U "huggingface_hub"
256256

257257
As of `huggingface_hub` 0.32.0, this will also install `hf_xet`.
258258

259-
All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/storage-backends).
259+
All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/xet/index).
260260

261261
Note: `hf_transfer` was formerly used with the LFS storage backend and is now deprecated; use `hf_xet` instead.

docs/source/en/guides/manage-cache.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable to true.
174174

175175
## Chunk-based caching (Xet)
176176

177-
To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks (immutable byte ranges of files ~64KB in size) and shards (a data structure that maps files to chunks). For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/storage-backends).
177+
To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks (immutable byte ranges of files ~64KB in size) and shards (a data structure that maps files to chunks). For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/xet/index).
178178

179179
The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads. It has the following structure:
180180

@@ -201,7 +201,7 @@ Note that the `xet` caching system, like the rest of `hf_xet` is fully integrate
201201

202202
### `chunk_cache`
203203

204-
This cache is used on the download path. The cache directory structure is based on a base-64 encoded hash from the content-addressed store (CAS) that backs each Xet-enabled repository. A CAS hash serves as the key to lookup the offsets of where the data is stored.
204+
This cache is used on the download path. The cache directory structure is based on a base-64 encoded hash from the content-addressed store (CAS) that backs each Xet-enabled repository. A CAS hash serves as the key to lookup the offsets of where the data is stored. Note: as of `hf_xet` 1.2.0 the chunk_cache is disabled by default. To enable it, set the `HF_XET_CHUNK_CACHE_SIZE_BYTES` environment variable to the appropriate size prior to launching the Python process.
205205

206206
At the topmost level, the first two letters of the base 64 encoded CAS hash are used to create a subdirectory in the `chunk_cache` (keys that share these first two letters are grouped here). The inner levels are comprised of subdirectories with the full key as the directory name. At the base are the cache items which are ranges of blocks that contain the cached chunks.
207207

@@ -295,7 +295,7 @@ Example full `xet`cache directory tree:
295295
│ │ │ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb
296296
```
297297

298-
To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/storage-backends).
298+
To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/xet/index).
299299

300300
## Caching assets
301301

docs/source/en/guides/upload.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ pip install -U "huggingface_hub"
176176

177177
As of `huggingface_hub` 0.32.0, this will also install `hf_xet`.
178178

179-
All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/storage-backends).
179+
All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/xet/index).
180180

181181
**Cluster / Distributed Filesystem Upload Considerations**
182182

docs/source/en/package_reference/environment_variables.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -89,13 +89,13 @@ Integer value to define the number of seconds to wait for server response when d
8989

9090
### HF_XET_CHUNK_CACHE_SIZE_BYTES
9191

92-
To set the size of the Xet chunk cache locally. Increasing this will give more space for caching terms/chunks fetched from S3. A larger cache can better take advantage of deduplication across repos & files. If your network speed is much greater than your local disk speed (ex 10Gbps vs SSD or worse) then consider disabling the Xet cache for increased performance. To disable the Xet cache, set `HF_XET_CHUNK_CACHE_SIZE_BYTES=0`.
92+
To set the size of the Xet chunk cache locally. By default, the chunk cache is disabled. The chunk cache can be beneficial if you are generating new revisions to existing models or datasets as this is used to cache terms/chunks that are fetched from S3. A larger cache can better take advantage of deduplication across repos & files. To enable the chunk cache set the environment variable to a large number (10GB) or greater. However, in most cases when downloading or uploading new data, disabling the chunk cache will have better performance, which is why it is disabled by default.
9393

94-
Defaults to `10000000000` (10GB).
94+
Defaults to `0` (0 bytes, means chunk cache is disabled).
9595

9696
### HF_XET_SHARD_CACHE_SIZE_LIMIT
9797

98-
To set the size of the Xet shard cache locally. Increasing this will improve upload effeciency as chunks referenced in cached shard files are not re-uploaded. Note that the default soft limit is likely sufficient for most workloads.
98+
To set the size of the Xet shard cache locally. Increasing this will improve upload efficiency as chunks referenced in cached shard files are not re-uploaded. Note that the default soft limit is likely sufficient for most workloads.
9999

100100
Defaults to `4000000000` (4GB).
101101

@@ -169,7 +169,7 @@ You can set `HF_HUB_DISABLE_TELEMETRY=1` as environment variable to globally dis
169169

170170
### HF_HUB_DISABLE_XET
171171

172-
Set to disable using `hf-xet`, even if it is available in your Python environment. This is since `hf-xet` will be used automatically if it is found, this allows explicitly disabling its usage.
172+
Set to disable using `hf-xet`, even if it is available in your Python environment. This is since `hf-xet` will be used automatically if it is found, this allows explicitly disabling its usage. If you are disabling Xet, please consider [filing an issue and including the diagnostics](https://github.com/huggingface/xet-core?tab=readme-ov-file#issues-diagnostics--debugging) information to help us understand why Xet is not working for you.
173173

174174
### HF_HUB_ENABLE_HF_TRANSFER
175175

@@ -184,7 +184,7 @@ Set `hf-xet` to operate with increased settings to maximize network and disk res
184184

185185
Consider this analogous to the legacy `HF_HUB_ENABLE_HF_TRANSFER=1` environment variable but applied to `hf-xet`.
186186

187-
To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/storage-backends).
187+
To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/xet/index).
188188

189189
### HF_XET_RECONSTRUCT_WRITE_SEQUENTIALLY
190190

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def get_version() -> str:
1616
install_requires = [
1717
"filelock",
1818
"fsspec>=2023.5.0",
19-
"hf-xet>=1.1.3,<2.0.0; platform_machine=='x86_64' or platform_machine=='amd64' or platform_machine=='AMD64' or platform_machine=='arm64' or platform_machine=='aarch64'",
19+
"hf-xet>=1.2.0,<2.0.0; platform_machine=='x86_64' or platform_machine=='amd64' or platform_machine=='AMD64' or platform_machine=='arm64' or platform_machine=='aarch64'",
2020
"httpx>=0.23.0, <1",
2121
"packaging>=20.9",
2222
"pyyaml>=5.1",

0 commit comments

Comments
 (0)