Skip to content

bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061

@Muaykillz

Description

@Muaykillz

Description

Hi,

I noticed that Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not. Why not preprocess spaces the same way for both tokenizers to ensure consistency?

Thanks for your clarification!

Expected results

A clear explanation

Current results

Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not.

Steps to reproduce

PyThaiNLP version

5.0.5

Python version

3.9.6

Operating system and version

Google Colab Latest

More info

No response

Possible solution

No response

Files

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions