Multilanguage tokenizer with language detection

We want to support search on documents in different languages like common Latin languages (`Eng`, `Fra`, Deu...), Asian languages (`Jpn`, `Cmn`, ...), ...

To reach this goal, we need the following:
- a fast language detection algorithm as we don't want the detection phase to limit the indexing throughput. Cf [whichlang](https://github.com/quickwit-oss/whichlang/) repos.
- specific tokenizers for each language: for Latin languages, we could keep the current default tokenizer and have dedicated tokenizers for languages that are not Latin based (Chinese, Japanese, ...). There is [jieba](https://github.com/jiegec/tantivy-jieba) for Chinese and [lindera](https://github.com/lindera-morphology/lindera-tantivy) for Japanese.
- one text field per language or one text field for all of them to store the tokens in the inverted index. Having one text field for all languages may be a good first step as managing several text fields adds extra complexity.


Last but not least, we should specify how to declare this multilanguage tokenizer in the index config before jumping into the code.
For example, a user should be able to define his custom tokenizer like this:


```yaml
// index_config.yaml
tokenizers:
  multilanguage_jpn_cmn:
    default: default    // default tokenizer used
    cmn: jieba            // tokenizer used if cmn is detected
    jpn: lindera.         // tokenizer used if jpn is detected
```






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multilanguage tokenizer with language detection #3055

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multilanguage tokenizer with language detection #3055

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions