Allow specific characters in token_chars of edge ngram tokenizer in addition to classes

I bumped into this while implementing autocomplete/typeahead functionality with highlighting.

My index settings are:
```
  analysis:
    tokenizer:
      autocomplete_highlight:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
        token_chars: ["letter", "digit"]
    filter:
      autocomplete_ngram:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
    analyzer:
      autocomplete_index:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, autocomplete_ngram]
      autocomplete_search:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, stop]
      autocomplete_highlight:
        type: custom
        tokenizer: autocomplete_highlight
        filter: [standard, icu_normalizer, icu_folding]
```

I do the search by `autocomplete` field and highlight on `autocomplete_highlight`. Everything works fine until I meet `_` in a search query. `icu_tokenizer` keeps it while `autocomplete_highlight` tokenizer removes as it keeps letters and digits only. Here I can't keep `_` only but full `punctuation` class instead that comes with a whole load of additional symbols that I don't need and they have to go.

I would be helpful to be able to specify exact characters to keep like `_`.

At the moment I've implemented char_filter that replaces `_` with `-` but that's suboptimal as `_` is considered a part of words (same as in `icu_tokenizer`) and is expected to match rather than being ignored.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions