Skip to content

Multilanguage tokenizer with language detection #3055

@fmassot

Description

@fmassot

We want to support search on documents in different languages like common Latin languages (Eng, Fra, Deu...), Asian languages (Jpn, Cmn, ...), ...

To reach this goal, we need the following:

  • a fast language detection algorithm as we don't want the detection phase to limit the indexing throughput. Cf whichlang repos.
  • specific tokenizers for each language: for Latin languages, we could keep the current default tokenizer and have dedicated tokenizers for languages that are not Latin based (Chinese, Japanese, ...). There is jieba for Chinese and lindera for Japanese.
  • one text field per language or one text field for all of them to store the tokens in the inverted index. Having one text field for all languages may be a good first step as managing several text fields adds extra complexity.

Last but not least, we should specify how to declare this multilanguage tokenizer in the index config before jumping into the code.
For example, a user should be able to define his custom tokenizer like this:

// index_config.yaml
tokenizers:
  multilanguage_jpn_cmn:
    default: default    // default tokenizer used
    cmn: jieba            // tokenizer used if cmn is detected
    jpn: lindera.         // tokenizer used if jpn is detected

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions