Update SimpleTokenizer for SAM3 tokenizer convenience#37
Conversation
|
👋 Hello @Laughing-q, thank you for submitting a
For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀 |
UltralyticsAssistant
left a comment
There was a problem hiding this comment.
🔍 PR Review
Made with ❤️ by Ultralytics Actions
Overall the change is small and the new callable API looks correct for padding/truncation with SOT/EOT wrapping. Main issues: avoid assert for runtime validation, fix the malformed docstring, and consider aligning per-row tensor creation dtype/device with the preallocated output tensor to prevent extra casts/copies and improve usability.
💬 Posted 3 inline comments
UltralyticsAssistant
left a comment
There was a problem hiding this comment.
🔍 PR Review 2
Made with ❤️ by Ultralytics Actions
Clean, focused change overall: caching SOT/EOT IDs and adding a callable API is useful. The main risk is relying on self.context_length without guaranteeing it exists, which can cause runtime AttributeError. Also, assert isn’t ideal for validating public inputs; an explicit exception is safer.
💬 Posted 3 inline comments
UltralyticsAssistant
left a comment
There was a problem hiding this comment.
🔍 PR Review 3
Made with ❤️ by Ultralytics Actions
Overall clean, focused change: caching SOT/EOT IDs and adding a callable API is straightforward and should help downstream usage. The only issue worth addressing is the context_length validation: using or plus an assert can mask invalid inputs and may be skipped in optimized runs; switching to explicit None handling and raising ValueError would make this more robust.
💬 Posted 1 inline comment
|
Merged — thank you for the awesome improvement, @Laughing-q (and thanks @fcakyon for the contributions)! 🎉 As Leonardo da Vinci famously said, “Simplicity is the ultimate sophistication.” This PR embodies that: adding a clean, PyTorch-friendly Really appreciate the thoughtful design choices and the focus on developer experience—this will make batching and integration into PyTorch pipelines much smoother for everyone. |
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Adds a PyTorch-friendly
__call__API to the CLIPSimpleTokenizerfor easy text→token tensor conversion 🚀📊 Key Changes
from __future__ import annotationsfor cleaner type hints 🧩torchdependency inclip/simple_tokenizer.pyto return tensors 🔥sot_token_id(<|startoftext|>)eot_token_id(<|endoftext|>)context_length = 77🧠SimpleTokenizer.__call__(texts, context_length=None) -> torch.LongTensor:LongTensorof shape[batch, context_length]eot_token_id✂️🎯 Purpose & Impact
torchmay need to install it or avoid importing this tokenizer 📦