Description
I bumped into this while implementing autocomplete/typeahead functionality with highlighting.
My index settings are:
analysis:
tokenizer:
autocomplete_highlight:
type: edgeNGram
min_gram: 1
max_gram: 15
token_chars: ["letter", "digit"]
filter:
autocomplete_ngram:
type: edgeNGram
min_gram: 1
max_gram: 15
analyzer:
autocomplete_index:
type: custom
tokenizer: icu_tokenizer
filter: [standard, icu_normalizer, icu_folding, autocomplete_ngram]
autocomplete_search:
type: custom
tokenizer: icu_tokenizer
filter: [standard, icu_normalizer, icu_folding, stop]
autocomplete_highlight:
type: custom
tokenizer: autocomplete_highlight
filter: [standard, icu_normalizer, icu_folding]
I do the search by autocomplete
field and highlight on autocomplete_highlight
. Everything works fine until I meet _
in a search query. icu_tokenizer
keeps it while autocomplete_highlight
tokenizer removes as it keeps letters and digits only. Here I can't keep _
only but full punctuation
class instead that comes with a whole load of additional symbols that I don't need and they have to go.
I would be helpful to be able to specify exact characters to keep like _
.
At the moment I've implemented char_filter that replaces _
with -
but that's suboptimal as _
is considered a part of words (same as in icu_tokenizer
) and is expected to match rather than being ignored.