Skip to content

Simplify TokenizerArgs __post_init__: Unnecessarily verbose #1518

Closed
@Jack-Khuu

Description

@Jack-Khuu

🚀 The feature, motivation and pitch

TokenizerArgs.__post_init__ has grown quite verbose/redundant and could use a bit of simplification

class TokenizerArgs:
tokenizer_path: Optional[Union[Path, str]] = None
is_sentencepiece: bool = False
is_tiktoken: bool = False
is_hf_tokenizer: bool = False
t: Optional[Any] = None
def __post_init__(self):
try:
from tokenizer.tiktoken import Tokenizer as TiktokenTokenizer
self.t = TiktokenTokenizer(model_path=str(self.tokenizer_path))
self.is_tiktoken = True
self.is_sentencepiece = False
self.is_hf_tokenizer = False
return
except:
pass
try:
from sentencepiece import SentencePieceProcessor
self.t = SentencePieceProcessor(model_file=str(self.tokenizer_path))
self.is_tiktoken = False
self.is_sentencepiece = True
self.is_hf_tokenizer = False
return
except:
pass
try:
from tokenizer.hf_tokenizer import HFTokenizer
self.t = HFTokenizer(str(self.tokenizer_path))
self.is_tiktoken = False
self.is_sentencepiece = False
self.is_hf_tokenizer = True
return
except:
pass
self.is_tiktoken = False
self.is_sentencepiece = False
self.is_hf_tokenizer = False
self.t = None
return

Task: Simplify the logic in post_init to reduce redundancy

To test, run a model with each tokenizer type:

  • python torchchat.py generate llama2
  • python torchchat.py generate llama3
  • python torchchat.py generate granite-code

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

Metadata

Metadata

Labels

actionableItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions