-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transformers-cli: LFS multipart uploads (> 5GB) #8663
Conversation
This reverts commit d7e32c4.
👍 nice job! |
src/transformers/commands/lfs.py
Outdated
def run(self): | ||
local_path = os.path.abspath(self.args.path) | ||
if not os.path.isdir(local_path): | ||
print("This does not look like a valid git repo") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to raise an error?
return data | ||
|
||
def __iter__(self): | ||
yield self.read(n=4 * 1024 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 4 * 1024 * 1024?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magic value from @madlag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,219 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a copyright at the top?
``` [lfs "customtransfer.multipart"] | ||
|
||
path = /path/to/transformers/.env/bin/python | ||
|
||
args = -m debugpy --listen 5678 --wait-for-client /path/to/transformers/src/transformers/commands/transformers_cli.py | ||
lfs-multipart-upload ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is weirdly formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep – is there any way to tell our formatting tooling to not change lines which represent code or structured stuff?
src/transformers/commands/lfs.py
Outdated
|
||
|
||
class LfsCommands(BaseTransformersCLICommand): | ||
# Implementation of a custom transfer agent for the transfer type "multipart" for git-lfs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
src/transformers/hf_api.py
Outdated
assert filetype in self.ALLOWED_S3_FILE_TYPES, "Please specify filetype from {}".format( | ||
self.ALLOWED_S3_FILE_TYPES | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
src/transformers/hf_api.py
Outdated
assert filetype in self.ALLOWED_S3_FILE_TYPES, "Please specify filetype from {}".format( | ||
self.ALLOWED_S3_FILE_TYPES | ||
) | ||
path = "{}/api/{}/listObjs".format(self.endpoint, filetype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
src/transformers/hf_api.py
Outdated
assert filetype in self.ALLOWED_S3_FILE_TYPES, "Please specify filetype from {}".format( | ||
self.ALLOWED_S3_FILE_TYPES | ||
) | ||
path = "{}/api/{}/deleteObj".format(self.endpoint, filetype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
src/transformers/hf_api.py
Outdated
""" | ||
HuggingFace S3-based system, used for datasets and metrics. | ||
|
||
Get a presigned url, then upload file to S3. | ||
|
||
Outputs: url: Read-only url for the stored file on S3. | ||
""" | ||
urls = self.presign(token, filename=filename, organization=organization) | ||
assert filetype in self.ALLOWED_S3_FILE_TYPES, "Please specify filetype from {}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) capital letter for "ALLOWED_S3_FILE_TYPES" for class attribute is a bit weird for me. I'd prefer allowed_s3_file_types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backported from code in datasets
so staying consistent
""" | ||
HuggingFace git-based system, used for models. | ||
|
||
Call HF API to create a whole repo. | ||
|
||
Params: | ||
lfsmultipartthresh: Optional: internal param for testing purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change lfsmultipartthresh
to lfs_multi_part_thresh
? => very hard to read :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd rather keep the consistency with the server side on that one
@@ -73,29 +82,35 @@ def test_whoami(self): | |||
|
|||
def test_presign_invalid_org(self): | |||
with self.assertRaises(HTTPError): | |||
_ = self._api.presign(token=self._token, filename="nested/fake_org.txt", organization="fake") | |||
_ = self._api.presign( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) no need for _ =
=> just self._api.presign(...)
i good I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of nits. Thanks a lot for the PR! Also cc @agemagician
Thanks a lot for the PR, and thanks for letting me know. I will give my feedback after testing it with the 11B model soon. |
Hi, thanks a lot for the PR.
Then, after half an hour later, it raised an error:
Did I do something wrong? |
@mymusise might have been an intermittent server error. Can you try again? |
Yes, I try again and again. But it always raises this error at 👀 But, now
|
@mymusise Yes, looks like this is crashing/locking the server 😱 Do you mind trying again on Monday? As we'll have more bandwidth to fix then. Sorry about that :/ |
It's ok, guy, haha. Waiting for your good news. |
Hi, I try again today and push the big model file without any exception! 🎉 |
Hi, here I push another model file(4.9GB) again, but this time it gives me a 504 Gateway Time-out error 😓
Seems the big file is uploaded completely, it looks like there is some problem with the server configuration about the timeout. @julien-c Tried three times with the same result. |
@Pierrci - I ran some tests on ~10 GB files (
In case it is useful, here is a link to the I can always go back to the way of manually uploading to the git-lfs hash path. Maybe the error message is helpful though :-) |
Thanks @patrickvonplaten, I see where it might be coming from, gonna look at it now! |
I deployed a fix that should address your problem, can you try again @patrickvonplaten? @mymusise Is it possible for you to try again with |
Thanks, @Pierrci. Yes, I think my error is different from Patrick's one. Here is the information with |
Awesome it works now @Pierrci - thanks! :-) |
@mymusise Are you sure LFS is properly installed and configured for the repo? From your logs it seems your |
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
Implementation of a custom transfer agent for the transfer type "multipart" for git-lfs. This lets users upload large files >5GB 🔥.
Spec for LFS custom transfer agent is: https://github.com/git-lfs/git-lfs/blob/master/docs/custom-transfers.md
The PR introduces two commands to the CLI:
^ Do this once per model repo where you want to push >5GB files. It's documented in the error message you get if you just try to
git push
a 5GB file without having enabled it before.^ is the custom transfer agent itself. This is not meant to be called by the user, but by lfs directly.
Things to experiment with:
Experiment: