-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLI to upload arbitrary huge folder #2254
Changes from 91 commits
1004434
5a8605e
68a6cf1
9b25f38
8e903f8
5f610ee
5a9762f
283977d
e909022
39cfef4
dbece97
0206964
82b46b3
3300b28
c606a94
95171ef
7180746
4e664d4
7bb263e
3595042
3401880
9210648
fb477e5
0eacbc9
1a4320a
8c9dc8b
6260a17
c788e2d
3a45f4b
30abe07
4317f5a
94951e4
0061401
8d31fc0
4f6f531
41c8ae3
84f55ff
f5d1faa
edc9790
a768426
d0ea3ea
dffa539
d414825
0a5605c
9e6d569
e6fe766
28991c9
a0b61a1
fccabe0
c089477
6295ead
06c9dec
d6e4163
33be5f0
e48d525
64352de
d936caf
8dae1ac
8326ea2
e747f9b
3d7d5a3
07650cd
095aa90
5119e74
76af338
bf0f967
ba7f248
dd3c31c
b134e17
b725f94
cd8bac2
4ccfec0
d8f6846
5f16da2
f03b55d
da0bb42
1b04201
84c65a5
f31dfee
d0ae3a0
602f299
dabd594
5583f0c
69dd31d
86edb11
bc7129f
0dc048a
8181193
b568a54
3bbdc2e
ea765be
46ce7a8
ba6e4a8
3e2db8e
cac2b50
cdfb27f
6d8017a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -138,3 +138,5 @@ dmypy.json | |
|
||
# Spell checker config | ||
cspell.json | ||
|
||
tmp* |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -103,6 +103,80 @@ set, files are uploaded at the root of the repo. | |
|
||
For more details about the CLI upload command, please refer to the [CLI guide](./cli#huggingface-cli-upload). | ||
|
||
## Upload a large folder | ||
|
||
In most cases, the [`upload_folder`] method and `huggingface-cli upload` command should be the go-to solutions to upload files to the Hub. They ensure a single commit will be made, handle a lot of use cases, and fail explicitly when something wrong happens. However, when dealing with a large amount of data, you will usually prefer a resilient process even if it leads to more commits or requires more CPU usage. The [`upload_large_folder`] method has been implemented in that spirit: | ||
- it is resumable: the upload process is split into many small tasks (hashing files, pre-uploading them, and committing them). Each time a task is completed, the result is cached locally in a `./cache/huggingface` folder inside the folder you are trying to upload. By doing so, restarting the process after an interruption will resume all completed tasks. | ||
- it is multi-threaded: hashing large files and pre-uploading them benefits a lot from multithreading if your machine allows it. | ||
- it is resilient to errors: a high-level retry-mechanism has been added to retry each independent task indefinitely until it passes (no matter if it's a OSError, ConnectionError, PermissionError, etc.). This mechanism is double-edged. If transient errors happen, the process will continue and retry. If permanent errors happen (e.g. permission denied), it will retry indefinitely without solving the root cause. | ||
|
||
If you want more technical details about how `upload_large_folder` is implemented under the hood, please have a look to the [`upload_large_folder`] package reference. | ||
|
||
Here is how to use [`upload_large_folder`] in a script. The method signature is very similar to [`upload_folder`]: | ||
|
||
```py | ||
>>> api.upload_large_folder( | ||
... repo_id="HuggingFaceM4/Docmatix", | ||
... repo_type="dataset", | ||
... folder_path="/path/to/local/docmatix", | ||
... ) | ||
``` | ||
|
||
You will see the following output in your terminal: | ||
``` | ||
Repo created: https://huggingface.co/datasets/HuggingFaceM4/Docmatix | ||
Found 5 candidate files to upload | ||
Recovering from metadata files: 100%|█████████████████████████████████████| 5/5 [00:00<00:00, 542.66it/s] | ||
|
||
---------- 2024-07-22 17:23:17 (0:00:00) ---------- | ||
Files: hashed 5/5 (5.0G/5.0G) | pre-uploaded: 0/5 (0.0/5.0G) | committed: 0/5 (0.0/5.0G) | ignored: 0 | ||
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 5 | committing: 0 | waiting: 11 | ||
--------------------------------------------------- | ||
``` | ||
|
||
First, the repo is created if it didn't exist before. Then, the local folder is scanned for files to upload. For each file, we try to recover metadata information (from a previously interrupted upload). From there, it is able to launch workers and print an update status every 1 minute. Here, we can see that 5 files have already been hashed but not pre-uploaded. 5 workers are pre-uploading files while the 11 others are waiting for a task. | ||
|
||
A command line is also provided. You can define the number of workers and the level of verbosity in the terminal: | ||
|
||
```sh | ||
huggingface-cli upload-large-folder HuggingFaceM4/Docmatix --repo-type=dataset /path/to/local/docmatix --num-workers=16 | ||
``` | ||
|
||
<Tip> | ||
|
||
For large uploads, you have to set `repo_type="model"` or `--repo-type=model` explicitly. Usually, this information is implicit in all other `HfApi` methods. This is to avoid having data uploaded to a repository with a wrong type. If that's the case, you'll have to re-upload everything. | ||
|
||
</Tip> | ||
|
||
<Tip warning={true}> | ||
|
||
While being much more robust to upload large folders, `upload_large_folder` is more limited than [`upload_folder`] feature-wise. In practice: | ||
- you cannot set a custom `path_in_repo`. If you want to upload to a subfolder, you need to set the proper structure locally. | ||
- you cannot set a custom `commit_message` and `commit_description` since multiple commits are created. | ||
- you cannot delete from the repo while uploading. Please make a separate commit first. | ||
- you cannot create a PR directly. Please create a PR first and then commit to it by passing `revision`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because we would have to add some logic to ensure the script is resumable. If Since large uploads are meant for long lasting processes, I don't think that manually creating PRs will be a blocker here. Worse case, the user can create a script to call |
||
|
||
</Tip> | ||
|
||
### Tips and tricks for large uploads | ||
|
||
There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data, getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying. | ||
|
||
Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Next, let's move on with some practical tips to make your upload process as smooth as possible. | ||
Wauplin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate on a script when failing takes only a little time. | ||
- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never be re-uploaded twice but checking it client-side can still save some time. This is what [`upload_large_folder`] does for you. | ||
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth. To use `hf_transfer`: | ||
Wauplin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. Specify the `hf_transfer` extra when installing `huggingface_hub` | ||
(e.g. `pip install huggingface_hub[hf_transfer]`). | ||
Wauplin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable. | ||
|
||
<Tip warning={true}> | ||
|
||
`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer). | ||
|
||
</Tip> | ||
|
||
## Advanced features | ||
|
||
In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub. | ||
|
@@ -418,36 +492,6 @@ you don't store another reference to it. This is expected as we don't want to ke | |
already uploaded. Finally we create the commit by passing all the operations to [`create_commit`]. You can pass | ||
additional operations (add, delete or copy) that have not been processed yet and they will be handled correctly. | ||
|
||
## Tips and tricks for large uploads | ||
|
||
There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data, | ||
getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying. | ||
|
||
Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Next, let's move on with some practical tips to make your upload process as smooth as possible. | ||
|
||
- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate | ||
on a script when failing takes only a little time. | ||
- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always | ||
best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our | ||
servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you | ||
already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never | ||
be re-uploaded twice but checking it client-side can still save some time. | ||
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up | ||
uploads on machines with very high bandwidth. To use `hf_transfer`: | ||
|
||
1. Specify the `hf_transfer` extra when installing `huggingface_hub` | ||
(e.g. `pip install huggingface_hub[hf_transfer]`). | ||
2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable. | ||
|
||
<Tip warning={true}> | ||
|
||
`hf_transfer` is a power user tool! | ||
It is tested and production-ready, | ||
but it lacks user-friendly features like advanced error handling or proxies. | ||
For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer). | ||
|
||
</Tip> | ||
|
||
## (legacy) Upload files with Git LFS | ||
|
||
All the methods described above use the Hub's API to upload files. This is the recommended way to upload files to the Hub. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a follow-up PR: I think we could support this. The PR title could be the commit message (and then just add "Part 1" or something like that)