Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HF] Implement save_to_hf_hub for Model and Tokenizer #3982

Merged
merged 9 commits into from
Dec 6, 2022

Conversation

sijunhe
Copy link
Collaborator

@sijunhe sijunhe commented Dec 1, 2022

PR types

New features

PR changes

APIs

Description

  • implement save_to_hf_hub for Model and Tokenizer
from paddlenlp.transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ci-test-ernie-model", from_hf_hub=True)
model = AutoModel.from_pretrained("PaddlePaddle/ci-test-ernie-model", from_hf_hub=True)

tokenizer.save_to_hf_hub("ci-test-ernie-model")
model.save_to_hf_hub("ci-test-ernie-model")

@paddle-bot
Copy link

paddle-bot bot commented Dec 1, 2022

Thanks for your contribution!

@sijunhe sijunhe requested a review from wj-Mcat December 1, 2022 11:14
@sijunhe sijunhe self-assigned this Dec 1, 2022
)

with tempfile.TemporaryDirectory() as tmp_dir:
# First create the repo (and clone its content if it's nonempty).
Copy link
Contributor

@wj-Mcat wj-Mcat Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这边有几个问题:

  1. 如果 repo 里面本身就有大文件,clone下载就会很慢?
  2. 如果在 clone 的过程由于时间过程失败了,那由于是创建的临时文件夹每次都得重来,是否会浪费时间,如果这个 tmp_dir能够指定是否更好呢?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你说的有道理。我新的实现取消了clone, 并且更换为http request


if len(large_files) > 0:
logger.info("Track files with git lfs: {}".format(", ".join(large_files)))
repo.lfs_track(large_files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git lfs是需要安装对应的包的,这里是否需要提供检测lsf是否存在并抛一个安装的 warning 呢?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已替换为http request

osanseviero
osanseviero previously approved these changes Dec 2, 2022
Copy link

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice!

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sijunhe , PR looks good to me ! :)
I have added 2 comments as you don't need to check for token in the HfFolder (that's a bit too "low-level" for an external library).

paddlenlp/transformers/model_utils.py Outdated Show resolved Hide resolved
paddlenlp/transformers/tokenizer_utils_base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) HfFolder is not used anymore. Sorry, forgot about it in previous review :)

paddlenlp/transformers/tokenizer_utils_base.py Outdated Show resolved Hide resolved
paddlenlp/transformers/model_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wj-Mcat wj-Mcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sijunhe sijunhe merged commit a7e760f into PaddlePaddle:develop Dec 6, 2022
@sijunhe sijunhe deleted the save_to_hf_hub branch December 6, 2022 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants