Skip to content

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

Closed
@edudar

Description

@edudar

I bumped into this while implementing autocomplete/typeahead functionality with highlighting.

My index settings are:

  analysis:
    tokenizer:
      autocomplete_highlight:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
        token_chars: ["letter", "digit"]
    filter:
      autocomplete_ngram:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
    analyzer:
      autocomplete_index:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, autocomplete_ngram]
      autocomplete_search:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, stop]
      autocomplete_highlight:
        type: custom
        tokenizer: autocomplete_highlight
        filter: [standard, icu_normalizer, icu_folding]

I do the search by autocomplete field and highlight on autocomplete_highlight. Everything works fine until I meet _ in a search query. icu_tokenizer keeps it while autocomplete_highlight tokenizer removes as it keeps letters and digits only. Here I can't keep _ only but full punctuation class instead that comes with a whole load of additional symbols that I don't need and they have to go.

I would be helpful to be able to specify exact characters to keep like _.

At the moment I've implemented char_filter that replaces _ with - but that's suboptimal as _ is considered a part of words (same as in icu_tokenizer) and is expected to match rather than being ignored.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions