Description
openedon Feb 1, 2024
This issue to track investigating and address the feedback we got regarding the tokenizers design
@stephentoub Feedback
If we’re able to make such breaking changes, we should also be reconsidering other aspects of the library then I think, in particular for perf, for example:
-
Token is class, which means allocation per token, plus the design effectively forces the string Value of the Token to be materialized even if it’s never used.
-
I don’t see any way to get just a token count without materializing the list of tokens, even though just the count is a commonly needed thing in these scenarios. Presumably such an API could get away with a lot less overhead / allocation. Address the feedback on the tokenizer's library #7024
-
Should there be support for spans baked in? Add Span support in tokenizer's Model abstraction #7035