Description
Hi,
I noticed that Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not. Why not preprocess spaces the same way for both tokenizers to ensure consistency?
Thanks for your clarification!
Expected results
A clear explanation
Current results
Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not.
Steps to reproduce
PyThaiNLP version
5.0.5
Python version
3.9.6
Operating system and version
Google Colab Latest
More info
No response
Possible solution
No response
Files
No response