Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI to upload arbitrary huge folder #2254

Merged
merged 97 commits into from
Aug 29, 2024
Merged

CLI to upload arbitrary huge folder #2254

merged 97 commits into from
Aug 29, 2024

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Apr 26, 2024

What for?

Upload arbitrarily large folders in a single command line!

How to use it?

Install

pip install git+https://github.com/huggingface/huggingface_hub

EDIT: PR has been merged so installation can be done from the main branch.

Upload folder

huggingface-cli upload-large-folder <repo-id> <local-path> --repo-type=dataset

Every minute a report is printed to the terminal with the current status. In addition to that, progress bars and errors are still displayed.

---------- 2024-04-26 16:24:25 (0:00:00) ----------
Files:   hashed 104/104 (22.5G/22.5G) | pre-uploaded: 0/42 (0.0/22.5G) | committed: 58/104 (24.9M/22.5G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 6 | committing: 0 | waiting: 0
---------------------------------------------------

Run huggingface-cli large-upload --help to see all options.

PR documentation:

What does it do?

This CLI is intended to upload arbitrary large folders in a single command:

  • process is split in 4 steps: hash, get upload mode, lfs upload, commit
  • retry on error at each step
  • multi-threaded: workers are managed with queues
  • resumable: if the process is interrupted, you can re-run it. Only partially uploaded files are lost.
  • files are hashed only once
  • starts to upload files while other files are still been hashed
  • commit at most 50 files at a time
  • prevent concurrent commits
  • prevent rate limits as much as possible
  • prevent small commits
  • retry on error for all steps

A .cache/huggingface/ folder will be created at the root of your folder to keep track of the progress. Please do not modify these files manually. If you feel this folder got corrupted, please report it here, delete the .huggingface/ entirely and then restart you command. Some intermediate steps will be lost but the upload process should be able to continue correctly.

Known limitations

  • cannot set a path_in_repo => always upload files at root of the folder. If you want to upload to a subfolder, you need to set the proper structure locally.
  • cannot delete files on repo while uploading folder
  • cannot set commit message/commit description
  • cannot create PR by itself => you must first create a PR manually, then provide revision

These limitations are documented.

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 28, 2024

@FurkanGozukara you an use huggingface-cli upload-large-folder --help to learn how to use the CLI.
To pass a repo type, you must add --repo-type=dataset for instance.

Comment on lines 66 to 68
raise ValueError(
"For large uploads, `repo_type` is explicitly required. Please set it to `model`, `dataset` or `space`."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback while using it @Wauplin: the error says:

ValueError: For large uploads, `repo_type` is explicitly required. Please set it to `model`, `dataset` or `space`.

but the expected argument is repo-type, as otherwise you get:

huggingface-cli: error: unrecognized arguments: --repo_type=model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in cdfb27f. Thanks for the feedback!

@FurkanGozukara
Copy link

it prints too many messages too frequently while uploading it even crashed my notebook

can we limit it to display status at the same line or maybe lesser more compact?

it prints like 100 messages like this every second

image

@FurkanGozukara
Copy link

it says uploads completed run triple times but there are no files in repo

You are about to upload a large folder to the Hub using `huggingface-cli upload-large-folder`. This is a new feature so feedback is very welcome!

A few things to keep in mind:
  - Repository limits still apply: https://huggingface.co/docs/hub/repositories-recommendations
  - Do not start several processes in parallel.
  - You can interrupt and resume the process at any time. The script will pick up where it left off except for partially uploaded files that would have to be entirely reuploaded.
  - Do not upload the same folder to several repositories. If you need to do so, you must delete the `./.cache/huggingface/` folder first.

Some temporary metadata will be stored under `/home/Ubuntu/apps/StableSwarmUI/Models/Lora/.cache/huggingface`.
  - You must not modify those files manually.
  - You must not delete the `./.cache/huggingface/` folder while a process is running.
  - You can delete the `./.cache/huggingface/` folder to reinitialize the upload state when process is not running. Files will have to be hashed and preuploaded again, except for already committed files.

For more details, run `huggingface-cli upload-large-folder --help` or check the documentation at https://huggingface.co/docs/huggingface_hub/guides/upload#upload-a-large-folder.
Repo created: https://huggingface.co/datasets/MonsterMMORPG/FLUX_Kohya_SS_Massive_Research_Part5
Found 181 candidate files to upload
Recovering from metadata files: 100%|████████| 181/181 [00:00<00:00, 564.82it/s]
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.



---------- 2024-08-28 17:53:59 (0:00:00) ----------
Files:   hashed 181/181 (371.5G/371.5G) | pre-uploaded: 161/161 (371.5G/371.5G) | committed: 181/181 (371.5G/371.5G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------

---------- 2024-08-28 17:54:00 (0:00:01) ----------
Files:   hashed 181/181 (371.5G/371.5G) | pre-uploaded: 161/161 (371.5G/371.5G) | committed: 181/181 (371.5G/371.5G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
INFO:huggingface_hub._upload_large_folder:
---------- 2024-08-28 17:54:00 (0:00:01) ----------
Files:   hashed 181/181 (371.5G/371.5G) | pre-uploaded: 161/161 (371.5G/371.5G) | committed: 181/181 (371.5G/371.5G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
.
.
UPLOAD COMPLETED

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 29, 2024

can we limit it to display status at the same line or maybe lesser more compact?

@FurkanGozukara yes you can do that by passing --no-bars and --no-reports in the command line. I have added a command to show it more prominently.

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 29, 2024

it says uploads completed run triple times but there are no files in repo

Have you tried to reupload the same folder to multiple locations? If yes, only the first upload will be correct. As mentioned in the little help section:

  • You can delete the ./.cache/huggingface/ folder to reinitialize the upload state when process is not running. Files will have to be hashed and preuploaded again, except for already committed files.

I suspect that your local metadata says the files are already uploaded. You can delete it and rerun the command.

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 29, 2024

Thanks everyone involved in this PR! Feedback from everyone has been immensely valuable to shape this feature. I hope it'll now benefit to as much users as possible! 🫶

Time to merge!

@Wauplin Wauplin merged commit ecbbeb3 into main Aug 29, 2024
19 checks passed
@Wauplin Wauplin deleted the large-upload-cli branch August 29, 2024 13:54
@FurkanGozukara
Copy link

it says uploads completed run triple times but there are no files in repo

Have you tried to reupload the same folder to multiple locations? If yes, only the first upload will be correct. As mentioned in the little help section:

  • You can delete the ./.cache/huggingface/ folder to reinitialize the upload state when process is not running. Files will have to be hashed and preuploaded again, except for already committed files.

I suspect that your local metadata says the files are already uploaded. You can delete it and rerun the command.

it appeared later for some reason

@FurkanGozukara
Copy link

@Wauplin can we set sub folder path right now

I tried like this for subfolder and it failed


!huggingface-cli upload-large-folder "MonsterMMORPG/3D-Cartoon-Style-FLUX" r"C:\flux training\upload" --repo-type=model --no-bars

print(".\n.\nUPLOAD COMPLETED")

image

image

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 3, 2024

Hi @FurkanGozukara, no this is currently not possible. See known limitations in the PR description:

cannot set a path_in_repo => always upload files at root of the folder. If you want to upload to a subfolder, you need to set the proper structure locally.

@FurkanGozukara
Copy link

Hi @FurkanGozukara, no this is currently not possible. See known limitations in the PR description:

cannot set a path_in_repo => always upload files at root of the folder. If you want to upload to a subfolder, you need to set the proper structure locally.

I saw it ty. How do I set proper structure locally? I tried like above screenshot it failed :/

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 3, 2024

I don't understand the error to be honest. The message says the provided path must be a folder so I guess it's a problem with the input parameters. Could you try to provide it with a python script instead of calling it from the CLI? That would help narrow down the problem.

from huggingface_hub import HfApi

api = HfApi()
api.upload_large_folder(repo_id=..., repo_type=..., ...)

More info in https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.upload_large_folder

@FurkanGozukara
Copy link

from huggingface_hub import HfApi

api = HfApi()
api.upload_large_folder(repo_id=..., repo_type=..., ...)

it both gave error and worked weird

i see files uploaded

image

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 3, 2024

This errors happens only if token is not valid. Doesn't seem related to tool itself.
In any case, glad to know the files have been uploaded :)

@FurkanGozukara
Copy link

This errors happens only if token is not valid. Doesn't seem related to tool itself. In any case, glad to know the files have been uploaded :)

thank you so much this new upload is amazing

@FurkanGozukara
Copy link

@Wauplin the new upload works amazing

but i get this warning / error - lots of times - uploading around 70 gb

is this expected? files are uploaded to repo successfully

image

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 4, 2024

This is not expected no but I supposed it has to do with how jupyter notebooks handle logs. Nothing much to worry about.

@FurkanGozukara
Copy link

@Wauplin i have been rate-limited for the first time

could this be related to new method?

None of the models uploaded fully, i trust resume capability at the moment :D

image

image

@FurkanGozukara
Copy link

@Wauplin i waited several hours

restarted process, and definitely it is hitting api limit when verifying which files were accurately uploaded

i don't know if can be solved or not . i have lots of small files

just letting you know

image

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 5, 2024

Hi @FurkanGozukara sorry for the inconvenience. How many files are we talking about and which size for each of them? And which file extension ? Also, are the files uploaded as regular or LFS files? This info would help knowing use cases that are not handled perfectly

@FurkanGozukara
Copy link

FurkanGozukara commented Sep 5, 2024

Hi @FurkanGozukara sorry for the inconvenience. How many files are we talking about and which size for each of them? And which file extension ? Also, are the files uploaded as regular or LFS files? This info would help knowing use cases that are not handled perfectly

10581 files

Around 44 files are big like 6 7 gb rest are images like 1 - 2 mb

Let me give you exact numbers via python scan 1 minute

@FurkanGozukara
Copy link

@Wauplin
here full list

Extension: 
  Total files: 1
  Average size: 4.00 KB
  Min size: 4.00 KB
  Max size: 4.00 KB

Extension: .metadata
  Total files: 10581
  Average size: 4.00 KB
  Min size: 4.00 KB
  Max size: 4.00 KB

Extension: .safetensors
  Total files: 44
  Average size: 6.35 GB
  Min size: 3.97 GB
  Max size: 6.46 GB

Extension: .json
  Total files: 3
  Average size: 4.00 KB
  Min size: 4.00 KB
  Max size: 4.00 KB

Extension: .toml
  Total files: 6
  Average size: 4.00 KB
  Min size: 4.00 KB
  Max size: 4.00 KB

Extension: .npz
  Total files: 5240
  Average size: 260.00 KB
  Min size: 260.00 KB
  Max size: 260.00 KB

Extension: .png
  Total files: 44
  Average size: 1.48 MB
  Min size: 1.29 MB
  Max size: 1.64 MB

Extension: .txt
  Total files: 42
  Average size: 4.00 KB
  Min size: 4.00 KB
  Max size: 4.00 KB

Extension: .jpg
  Total files: 5200
  Average size: 640.86 KB
  Min size: 220.00 KB
  Max size: 1.27 MB

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 6, 2024

Thanks for the details! I don't have the bandwidth to check that now but it can definitely prove useful at some point. Can I ask you to open a new issue dedicated to it? Describing your problem when getting rate limited with this structure of repo. Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.