-
Notifications
You must be signed in to change notification settings - Fork 490
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
We want to support search on documents in different languages like common Latin languages (Eng
, Fra
, Deu...), Asian languages (Jpn
, Cmn
, ...), ...
To reach this goal, we need the following:
- a fast language detection algorithm as we don't want the detection phase to limit the indexing throughput. Cf whichlang repos.
- specific tokenizers for each language: for Latin languages, we could keep the current default tokenizer and have dedicated tokenizers for languages that are not Latin based (Chinese, Japanese, ...). There is jieba for Chinese and lindera for Japanese.
- one text field per language or one text field for all of them to store the tokens in the inverted index. Having one text field for all languages may be a good first step as managing several text fields adds extra complexity.
Last but not least, we should specify how to declare this multilanguage tokenizer in the index config before jumping into the code.
For example, a user should be able to define his custom tokenizer like this:
// index_config.yaml
tokenizers:
multilanguage_jpn_cmn:
default: default // default tokenizer used
cmn: jieba // tokenizer used if cmn is detected
jpn: lindera. // tokenizer used if jpn is detected
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request