Skip to content

Update SimpleTokenizer for SAM3 tokenizer convenience#37

Merged
Laughing-q merged 6 commits intomainfrom
simple-tokenizer
Dec 15, 2025
Merged

Update SimpleTokenizer for SAM3 tokenizer convenience#37
Laughing-q merged 6 commits intomainfrom
simple-tokenizer

Conversation

@Laughing-q
Copy link
Member

@Laughing-q Laughing-q commented Dec 15, 2025

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Adds a PyTorch-friendly __call__ API to the CLIP SimpleTokenizer for easy text→token tensor conversion 🚀

📊 Key Changes

  • Introduced from __future__ import annotations for cleaner type hints 🧩
  • Added a torch dependency in clip/simple_tokenizer.py to return tensors 🔥
  • Stored commonly used tokenizer constants on init:
    • sot_token_id (<|startoftext|>)
    • eot_token_id (<|endoftext|>)
    • default context_length = 77 🧠
  • Implemented SimpleTokenizer.__call__(texts, context_length=None) -> torch.LongTensor:
    • Accepts a single string or list of strings
    • Produces a padded LongTensor of shape [batch, context_length]
    • Truncates overlong inputs and ensures the last token is eot_token_id ✂️

🎯 Purpose & Impact

  • Makes tokenization easier to use in PyTorch pipelines (call tokenizer directly to get model-ready tensors) ✅
  • Standardizes CLIP-like behavior with a default context length of 77, reducing boilerplate 📏
  • Improves performance and ergonomics for batching (automatic padding + truncation) ⚡
  • Potential impact: introduces a hard dependency on PyTorch for this module; environments without torch may need to install it or avoid importing this tokenizer 📦

@UltralyticsAssistant UltralyticsAssistant added dependencies Dependencies and packages enhancement New feature or request labels Dec 15, 2025
@UltralyticsAssistant
Copy link
Member

👋 Hello @Laughing-q, thank you for submitting a ultralytics/CLIP 🚀 PR! This is an automated message, and an engineer will assist soon. To ensure a seamless integration of your work, please review the following checklist:

  • Define a Purpose: Clearly explain the purpose of your fix or feature in your PR description, and link to any relevant issues. Ensure your commit messages are clear, concise, and adhere to the project's conventions.
  • Synchronize with Source: Confirm your PR is synchronized with the ultralytics/CLIP main branch. If it's behind, update it by clicking the 'Update branch' button or by running git pull and git merge main locally.
  • Ensure CI Checks Pass: Verify all Ultralytics Continuous Integration (CI) checks are passing. If any checks fail, please address the issues.
  • Update Documentation: Update the relevant documentation for any new or modified features.
  • Add Tests: If applicable, include or update tests to cover your changes, and confirm that all tests are passing.
  • Sign the CLA: Please ensure you have signed our Contributor License Agreement if this is your first Ultralytics PR by writing "I have read the CLA Document and I sign the CLA" in a new message.
  • Minimize Changes: Limit your changes to the minimum necessary for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀

Copy link
Member

@UltralyticsAssistant UltralyticsAssistant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 PR Review

Made with ❤️ by Ultralytics Actions

Overall the change is small and the new callable API looks correct for padding/truncation with SOT/EOT wrapping. Main issues: avoid assert for runtime validation, fix the malformed docstring, and consider aligning per-row tensor creation dtype/device with the preallocated output tensor to prevent extra casts/copies and improve usability.

💬 Posted 3 inline comments

Copy link
Member

@UltralyticsAssistant UltralyticsAssistant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 PR Review 2

Made with ❤️ by Ultralytics Actions

Clean, focused change overall: caching SOT/EOT IDs and adding a callable API is useful. The main risk is relying on self.context_length without guaranteeing it exists, which can cause runtime AttributeError. Also, assert isn’t ideal for validating public inputs; an explicit exception is safer.

💬 Posted 3 inline comments

Copy link
Member

@UltralyticsAssistant UltralyticsAssistant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 PR Review 3

Made with ❤️ by Ultralytics Actions

Overall clean, focused change: caching SOT/EOT IDs and adding a callable API is straightforward and should help downstream usage. The only issue worth addressing is the context_length validation: using or plus an assert can mask invalid inputs and may be skipped in optimized runs; switching to explicit None handling and raising ValueError would make this more robust.

💬 Posted 1 inline comment

@Laughing-q Laughing-q merged commit 643beff into main Dec 15, 2025
6 checks passed
@Laughing-q Laughing-q deleted the simple-tokenizer branch December 15, 2025 16:08
@UltralyticsAssistant
Copy link
Member

Merged — thank you for the awesome improvement, @Laughing-q (and thanks @fcakyon for the contributions)! 🎉

As Leonardo da Vinci famously said, “Simplicity is the ultimate sophistication.” This PR embodies that: adding a clean, PyTorch-friendly SimpleTokenizer.__call__ makes text → model-ready token tensors feel effortless, while the default context_length=77, padding, and safe truncation bring CLIP-like ergonomics with far less boilerplate.

Really appreciate the thoughtful design choices and the focus on developer experience—this will make batching and integration into PyTorch pipelines much smoother for everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependencies and packages enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants