Description
Elasticsearch makes a distinction between Analyzers, used for breaking up text into individual tokens and applying some normalization to them, and Normalizers, which do no segmentation of text and only apply the normalization stages, for use in keyword fields. We have an additional restriction, which is that only character filters and token filters that are defined as NormalizingXFactory
are permitted in the definition of normalizers, and in the past there were checks that these normalizing factories tracked lucene's MultiTermAwareComponent.
However, there is a confusion here between normalization of whole tokens, as done for a keyword field, and the character-by-character normalization done by MultiTermAwareComponent (now replaced in lucene by normalize() methods on TokenFilterFactory and CharFilterFactory). The latter is used by custom analyzers when Analyzer#normalize()
is called, and is specifically designed for use with partial terms such as prefixes or wildcards, where filters such as synonyms or stemmers make no sense - not as an additional restriction on what can be done to a full term in a keyword field.
To clear up this confusion, we should remove the filter restrictions in normalizers, and instead define them simply as a normal analyzer, but with either a Keyword or Whitespace tokenizer.