Closed
Description
Elasticsearch Version: 7.6.1
It appears that the only supported setting for the CharGroupTokinzer
is tokenize_on_chars
. This is fine for most users as long as the resulting words (after the split) are less than 256 characters long. If longer, words will be truncated.
This behaviour is the cause of a default setting in org.apache.lucene.analysis.util.CharTokenizer
:
public static final int DEFAULT_MAX_WORD_LEN = 255;
However, Lucene allows for overriding this default value, which is something that should be done here as well.