Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use hf tokenizer #147

Merged
merged 1 commit into from
Nov 4, 2024
Merged

Add option to use hf tokenizer #147

merged 1 commit into from
Nov 4, 2024

Conversation

RissyRan
Copy link
Collaborator

@RissyRan RissyRan commented Nov 4, 2024

Description

Add option to use hf tokenizer

Test

  • provide invalid hf tokenizer, it will failed to find right tokenizer
python JetStream/benchmarks/benchmark_serving.py   \
--tokenizer ~/maxtext/assets/tokenizer.mistral-v3  \
--save-result   \
--save-request-outputs   \
--request-outputs-file-path outputs-scanned-8x22b-hf-4.json   \
--num-prompts 4   \
--max-output-length 1024   \
--dataset openorca --run-eval true --use-hf-tokenizer=True


Using HuggingFace tokenizer: /home/ranran/maxtext/assets/tokenizer.mistral-v3
Traceback (most recent call last):
  File "/home/ranran/venv-maxtext/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
  File "/home/ranran/venv-maxtext/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/ranran/venv-maxtext/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ranran/maxtext/assets/tokenizer.mistral-v3'. Use `repo_type` argument if needed.
  • provide valid hf tokenizer, it will pass
python JetStream/benchmarks/benchmark_serving.py   \
--tokenizer mistralai/Mixtral-8x22B-Instruct-v0.1  \
--save-result   \
--save-request-outputs   \
--request-outputs-file-path outputs-scanned-8x22b-hf-4.json   \
--num-prompts 4   \
--max-output-length 1024   \
--dataset openorca --run-eval true --use-hf-tokenizer=True

Using HuggingFace tokenizer: mistralai/Mixtral-8x22B-Instruct-v0.1
len(sampled_indices)=4
In InputRequest, pass in actual output_length for each sample
The dataset contains 4 samples.
The filtered dataset contains 4 samples.
  0%|                                                                                                                               | 0/4 [00:00<?, ?it/s]

@RissyRan RissyRan changed the title Hardcode HF tokenizer [DO NOT MERGE] Hardcode HF tokenizer Nov 4, 2024
@RissyRan RissyRan changed the title [DO NOT MERGE] Hardcode HF tokenizer Add option to use hf tokenizer Nov 4, 2024
@RissyRan RissyRan force-pushed the 8x22b_debug branch 3 times, most recently from fde6a65 to d4702c3 Compare November 4, 2024 22:17
@RissyRan RissyRan marked this pull request as ready for review November 4, 2024 22:18
@RissyRan RissyRan requested a review from vipannalla as a code owner November 4, 2024 22:18
@RissyRan RissyRan force-pushed the 8x22b_debug branch 3 times, most recently from a55edbf to d3d6128 Compare November 4, 2024 22:36
Copy link
Collaborator

@vipannalla vipannalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

return "test"
elif use_hf_tokenizer:
print(f"Using HuggingFace tokenizer: {tokenizer_name}")
return AutoTokenizer.from_pretrained(tokenizer_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User needs to accept the MixtralAI (or other model) agreeement on HuggingFace, else this line will error out. Can you add a comment about it for future users?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment accordingly.

@RissyRan RissyRan merged commit 52d63a5 into main Nov 4, 2024
3 checks passed
@RissyRan RissyRan deleted the 8x22b_debug branch November 4, 2024 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants