Skip to content

CharGroupTokenizerFactory should allow users to set the maximum character limit for a word #56676

Closed
@ADBalici

Description

@ADBalici

Elasticsearch Version: 7.6.1

It appears that the only supported setting for the CharGroupTokinzer is tokenize_on_chars. This is fine for most users as long as the resulting words (after the split) are less than 256 characters long. If longer, words will be truncated.

This behaviour is the cause of a default setting in org.apache.lucene.analysis.util.CharTokenizer:

public static final int DEFAULT_MAX_WORD_LEN = 255;

However, Lucene allows for overriding this default value, which is something that should be done here as well.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions