Skip to content

Non-English tokenizers #464

Open
Open
@yf-hk

Description

@yf-hk

Describe the solution you'd like
For CJK languages, like for example Chinese, words are not separated by spaces. So there usually has a need to use a tokenizer to split sentences into word stems. Like this one: https://github.com/yanyiwu/cppjieba
Is it currently doable in Pisa? If not, is there any plan to add this feature in the future?

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is neededquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions