Closed
Description
Currently the options for ngram and shingle tokenizers/token filters allow the user to set min_size
and max_size
to any values. This is dangerous as users can set values which produces huge numbers of terms and at best bloat their index but at worst cause problems such as #25841.
I think we should add soft (and/or maybe hard) limits so that neither min_size
or max_size
can be more than say 6 and the difference between min_size
and max_size
can't be more than 2 or 3 (we may even want to make this limit 1).
Note that this does not apply to edge_ngrams
where it is useful to have higher values and a larger difference between min and max values. We should probably decide if there should be different limits here though.