Update `SimpleTokenizer` for SAM3 tokenizer convenience by Laughing-q · Pull Request #37 · ultralytics/CLIP

Laughing-q · 2025-12-15T11:47:15Z

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Adds a PyTorch-friendly __call__ API to the CLIP SimpleTokenizer for easy text→token tensor conversion 🚀

📊 Key Changes

Introduced from __future__ import annotations for cleaner type hints 🧩
Added a torch dependency in clip/simple_tokenizer.py to return tensors 🔥
Stored commonly used tokenizer constants on init:
- sot_token_id (<|startoftext|>)
- eot_token_id (<|endoftext|>)
- default context_length = 77 🧠
Implemented SimpleTokenizer.__call__(texts, context_length=None) -> torch.LongTensor:
- Accepts a single string or list of strings
- Produces a padded LongTensor of shape [batch, context_length]
- Truncates overlong inputs and ensures the last token is eot_token_id ✂️

🎯 Purpose & Impact

Makes tokenization easier to use in PyTorch pipelines (call tokenizer directly to get model-ready tensors) ✅
Standardizes CLIP-like behavior with a default context length of 77, reducing boilerplate 📏
Improves performance and ergonomics for batching (automatic padding + truncation) ⚡
Potential impact: introduces a hard dependency on PyTorch for this module; environments without torch may need to install it or avoid importing this tokenizer 📦

UltralyticsAssistant · 2025-12-15T11:47:51Z

👋 Hello @Laughing-q, thank you for submitting a ultralytics/CLIP 🚀 PR! This is an automated message, and an engineer will assist soon. To ensure a seamless integration of your work, please review the following checklist:

✅ Define a Purpose: Clearly explain the purpose of your fix or feature in your PR description, and link to any relevant issues. Ensure your commit messages are clear, concise, and adhere to the project's conventions.
✅ Synchronize with Source: Confirm your PR is synchronized with the ultralytics/CLIP main branch. If it's behind, update it by clicking the 'Update branch' button or by running git pull and git merge main locally.
✅ Ensure CI Checks Pass: Verify all Ultralytics Continuous Integration (CI) checks are passing. If any checks fail, please address the issues.
✅ Update Documentation: Update the relevant documentation for any new or modified features.
✅ Add Tests: If applicable, include or update tests to cover your changes, and confirm that all tests are passing.
✅ Sign the CLA: Please ensure you have signed our Contributor License Agreement if this is your first Ultralytics PR by writing "I have read the CLA Document and I sign the CLA" in a new message.
✅ Minimize Changes: Limit your changes to the minimum necessary for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀

UltralyticsAssistant

🔍 PR Review

_{Made with ❤️ by Ultralytics Actions}

Overall the change is small and the new callable API looks correct for padding/truncation with SOT/EOT wrapping. Main issues: avoid assert for runtime validation, fix the malformed docstring, and consider aligning per-row tensor creation dtype/device with the preallocated output tensor to prevent extra casts/copies and improve usability.

💬 Posted 3 inline comments

UltralyticsAssistant

🔍 PR Review 2

_{Made with ❤️ by Ultralytics Actions}

Clean, focused change overall: caching SOT/EOT IDs and adding a callable API is useful. The main risk is relying on self.context_length without guaranteeing it exists, which can cause runtime AttributeError. Also, assert isn’t ideal for validating public inputs; an explicit exception is safer.

💬 Posted 3 inline comments

UltralyticsAssistant

🔍 PR Review 3

_{Made with ❤️ by Ultralytics Actions}

Overall clean, focused change: caching SOT/EOT IDs and adding a callable API is straightforward and should help downstream usage. The only issue worth addressing is the context_length validation: using or plus an assert can mask invalid inputs and may be skipped in optimized runs; switching to explicit None handling and raising ValueError would make this more robust.

💬 Posted 1 inline comment

clip/simple_tokenizer.py

UltralyticsAssistant · 2025-12-15T16:09:22Z

Merged — thank you for the awesome improvement, @Laughing-q (and thanks @fcakyon for the contributions)! 🎉

As Leonardo da Vinci famously said, “Simplicity is the ultimate sophistication.” This PR embodies that: adding a clean, PyTorch-friendly SimpleTokenizer.__call__ makes text → model-ready token tensors feel effortless, while the default context_length=77, padding, and safe truncation bring CLIP-like ergonomics with far less boilerplate.

Really appreciate the thoughtful design choices and the focus on developer experience—this will make batching and integration into PyTorch pipelines much smoother for everyone.

Laughing-q and others added 2 commits December 15, 2025 19:43

Update clip/simple_tokenizer.py

e4034b3

Auto-format by https://ultralytics.com/actions

1e456bf

UltralyticsAssistant added dependencies Dependencies and packages enhancement New feature or request labels Dec 15, 2025

Laughing-q mentioned this pull request Dec 15, 2025

Use CLIP tokenizer functionality and remove the need of bpe_simple_vocab for SAM3 ultralytics/ultralytics#22958

Merged

Auto-format by https://ultralytics.com/actions

cf8ed5b

UltralyticsAssistant reviewed Dec 15, 2025

View reviewed changes

Laughing-q assigned Y-T-G Dec 15, 2025

Laughing-q added 2 commits December 15, 2025 19:49

Update clip/simple_tokenizer.py

1548af3

Merge branch 'temp' into simple-tokenizer

1a3353d

Laughing-q requested a review from UltralyticsAssistant December 15, 2025 11:53

UltralyticsAssistant reviewed Dec 15, 2025

View reviewed changes

add context_length

9de00e0

Laughing-q requested a review from UltralyticsAssistant December 15, 2025 12:03

UltralyticsAssistant reviewed Dec 15, 2025

View reviewed changes

clip/simple_tokenizer.py Show resolved Hide resolved

fcakyon approved these changes Dec 15, 2025

View reviewed changes

Laughing-q merged commit 643beff into main Dec 15, 2025
6 checks passed

Laughing-q deleted the simple-tokenizer branch December 15, 2025 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update `SimpleTokenizer` for SAM3 tokenizer convenience#37

Update `SimpleTokenizer` for SAM3 tokenizer convenience#37
Laughing-q merged 6 commits intomainfrom
simple-tokenizer

Laughing-q commented Dec 15, 2025 •

edited by UltralyticsAssistant

Loading

Uh oh!

UltralyticsAssistant commented Dec 15, 2025

Uh oh!

UltralyticsAssistant left a comment

Uh oh!

UltralyticsAssistant left a comment

Uh oh!

UltralyticsAssistant left a comment

Uh oh!

Uh oh!

Uh oh!

UltralyticsAssistant commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Laughing-q commented Dec 15, 2025 • edited by UltralyticsAssistant Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

Uh oh!

UltralyticsAssistant commented Dec 15, 2025

Uh oh!

UltralyticsAssistant left a comment

Choose a reason for hiding this comment

🔍 PR Review

Uh oh!

UltralyticsAssistant left a comment

Choose a reason for hiding this comment

🔍 PR Review 2

Uh oh!

UltralyticsAssistant left a comment

Choose a reason for hiding this comment

🔍 PR Review 3

Uh oh!

Uh oh!

Uh oh!

UltralyticsAssistant commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Laughing-q commented Dec 15, 2025 •

edited by UltralyticsAssistant

Loading