Skip to content

Allow all token and character filters to be used in normalizers #43758

Open
@romseygeek

Description

@romseygeek

Elasticsearch makes a distinction between Analyzers, used for breaking up text into individual tokens and applying some normalization to them, and Normalizers, which do no segmentation of text and only apply the normalization stages, for use in keyword fields. We have an additional restriction, which is that only character filters and token filters that are defined as NormalizingXFactory are permitted in the definition of normalizers, and in the past there were checks that these normalizing factories tracked lucene's MultiTermAwareComponent.

However, there is a confusion here between normalization of whole tokens, as done for a keyword field, and the character-by-character normalization done by MultiTermAwareComponent (now replaced in lucene by normalize() methods on TokenFilterFactory and CharFilterFactory). The latter is used by custom analyzers when Analyzer#normalize() is called, and is specifically designed for use with partial terms such as prefixes or wildcards, where filters such as synonyms or stemmers make no sense - not as an additional restriction on what can be done to a full term in a keyword field.

To clear up this confusion, we should remove the filter restrictions in normalizers, and instead define them simply as a normal analyzer, but with either a Keyword or Whitespace tokenizer.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions