Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading tiktoken tokenizer.model file #31656

Merged
merged 62 commits into from
Sep 6, 2024
Merged

Conversation

itazap
Copy link
Collaborator

@itazap itazap commented Jun 27, 2024

Use existing TikTokenConverter to convert tiktoken tokenizer.model file.
depends on loading without config.json file #32356

  • add case to convert_tiktoken_tokenizer
  • add internal model
  • add test

Workflow changes

  1. tokenization_utils_base.py': when loading a model, the slow tokenizer is loaded first. If the tokenizer.model file is not SPM, then an error of type google.protobuf.message.DecodeError is thrown, or a RunTime error on loading ModelProto. So, the first step is to catch these errors relating to SPM and set the tokenizer=False to indicate failure.
  2. tokenization_utils_fast.py: check if slow_tokenizer=False, if so, try to convert from tiktoken.
  3. convert_slow_tokenizer.py: use TikTokenConverter to convert.
  • Note: the reason we catch errors is because there is no way to differentiate the tokenizer.model file as SPM or TikToken with the current standards for hub files. So, we always try to convert from SPM, if we fail, we try with TikToken.

@ArthurZucker

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap force-pushed the tiktoken_file_support branch 2 times, most recently from cb46268 to 633cf73 Compare July 1, 2024 09:07
@itazap itazap marked this pull request as ready for review July 11, 2024 12:06
@huggingface huggingface locked and limited conversation to collaborators Jul 16, 2024
@huggingface huggingface unlocked this conversation Jul 16, 2024
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! A nit about special tokens mostly

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved
tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Show resolved Hide resolved
tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved
@itazap itazap requested a review from ArthurZucker July 22, 2024 17:11
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits but I like that it's easy to load a tiktoken based model with PreTrainedTokenizerFast! Good work

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/testing_utils.py Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved
@itazap itazap requested a review from ArthurZucker July 23, 2024 14:23
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition, we have one "last" decision to make and good to go!

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved
@itazap itazap requested a review from ArthurZucker July 31, 2024 09:47
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I'll let you handle the tests(CI) and pushing the new docker!

docs/source/en/tiktoken.md Outdated Show resolved Hide resolved
docs/source/en/tiktoken.md Outdated Show resolved Hide resolved
docs/source/en/tiktoken.md Outdated Show resolved Hide resolved
docs/source/en/tiktoken.md Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
@itazap itazap force-pushed the tiktoken_file_support branch 2 times, most recently from 00f6995 to 49de881 Compare August 7, 2024 10:03
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🤗 a few nits but LGTM otherwise

setup.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM! 🤗

docker/consistency.dockerfile Outdated Show resolved Hide resolved
@itazap itazap merged commit e48e5f1 into main Sep 6, 2024
26 checks passed
@itazap itazap deleted the tiktoken_file_support branch September 6, 2024 12:24
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Sep 6, 2024
* use existing TikTokenConverter to read tiktoken tokenizer.model file

* del test file

* create titktoken integration file

* adding tiktoken llama test

* ALTNATIVE IMPLEMENTATION: supports llama 405B

* fix one char

* remove redundant line

* small fix

* rm unused import

* flag for converting from tiktokeng

* remove unneeded file

* ruff

* remove llamatiktokenconverter, stick to general converter

* tiktoken support v2

* update test

* remove stale changes

* udpate doc

* protect import

* use is_protobuf_available

* add templateprocessor in tiktokenconverter

* reverting templateprocessor from tiktoken support

* update test

* add require_tiktoken

* dev-ci

* trigger build

* trigger build again

* dev-ci

* [build-ci-image] tiktoken

* dev-ci

* dev-ci

* dev-ci

* dev-ci

* change tiktoken file name

* feedback review

* feedback rev

* applying feedback, removing tiktoken converters

* conform test

* adding docs for review

* add doc file for review

* add doc file for review

* add doc file for review

* support loading model without config.json file

* Revert "support loading model without config.json file"

This reverts commit 2753602.

* remove dev var

* updating docs

* safely import protobuf

* fix protobuf import error

* fix protobuf import error

* trying isort to fix ruff error

* fix ruff error

* try to fix ruff again

* try to fix ruff again

* try to fix ruff again

* doc table of contents

* add fix for consistency.dockerfile torchaudio

* ruff

* applying feedback

* minor typo

* merging with push-ci-image

* clean up imports

* revert dockerfile consistency
@pcuenca pcuenca mentioned this pull request Sep 17, 2024
5 tasks
itazap added a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* use existing TikTokenConverter to read tiktoken tokenizer.model file

* del test file

* create titktoken integration file

* adding tiktoken llama test

* ALTNATIVE IMPLEMENTATION: supports llama 405B

* fix one char

* remove redundant line

* small fix

* rm unused import

* flag for converting from tiktokeng

* remove unneeded file

* ruff

* remove llamatiktokenconverter, stick to general converter

* tiktoken support v2

* update test

* remove stale changes

* udpate doc

* protect import

* use is_protobuf_available

* add templateprocessor in tiktokenconverter

* reverting templateprocessor from tiktoken support

* update test

* add require_tiktoken

* dev-ci

* trigger build

* trigger build again

* dev-ci

* [build-ci-image] tiktoken

* dev-ci

* dev-ci

* dev-ci

* dev-ci

* change tiktoken file name

* feedback review

* feedback rev

* applying feedback, removing tiktoken converters

* conform test

* adding docs for review

* add doc file for review

* add doc file for review

* add doc file for review

* support loading model without config.json file

* Revert "support loading model without config.json file"

This reverts commit 2753602.

* remove dev var

* updating docs

* safely import protobuf

* fix protobuf import error

* fix protobuf import error

* trying isort to fix ruff error

* fix ruff error

* try to fix ruff again

* try to fix ruff again

* try to fix ruff again

* doc table of contents

* add fix for consistency.dockerfile torchaudio

* ruff

* applying feedback

* minor typo

* merging with push-ci-image

* clean up imports

* revert dockerfile consistency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants