Skip to content

Support case insensitive search on new wildcard field and keyword #53603

Closed
@markharwood

Description

@markharwood

Currently the wildcard field only supports case sensitive search but it is vital that we find a way to offer case insensitive search too. A recent blog post highlighted general string-matching problems and how users have resorted to ugly regex expressions like this one to overcome issues with case sensitivity:

/[Cc]:\\[Ww][Ii][Nn][Dd][Oo][Ww][Ss]\\[Ss][Yy][Ss][Tt][Ee][Mm]32\\.*/

The example above is a search for a string from a case-insensitive operating system where hackers may have used mixed case commands deliberately to try avoid simpler rule detection.

Solution 1: Index-time case choices

We could make the wildcard field accept an optional normalizer to lower-case the content at index time (much like the keyword field). However, in a centralised logging system we may be storing content from both Windows and Unix machines which are case insensitive and case sensitive file systems respectively. The importance of case may vary from one document to the next. This would typically mean that we would be forced to index with multi-fields (one case sensitive, the other not) which would double the storage costs.

Solution 2: query-time choices

The wildcard field already has 2 representations of the original content - an ngram index for approximate matching and a binary doc value of the original bytes for verification of approximate matches. If the ngram index is changed to always use lower-case then the decision to have case-sensitive matching or not becomes a query-time option when verifying candidate matches. There would be a (likely small) increase in the number of false-positives from the approximate matching but the big advantage is no increase in today's storage costs (actually a decrease if we normalise ngrams).

In either solution the searcher has to make a conscious decision - either to search a case-insensitive field or to declare the query clause as case-insensitive.

Solution 2 looks preferable to me from the back end but is a break with existing approaches where case-sensitivity is an index-time mapping decision not a property of a query clause. This means that the wildcard query clause would have a case-sensitive parameter that is relevant if you target a wildcard field but not on a text or keyword field (although we could amend keyword field logic to support this too).

Thoughts @jimczi @jpountz ?

Metadata

Metadata

Assignees

Labels

:Search Foundations/MappingIndex mappings, including merging and defining field typesTeam:Search FoundationsMeta label for the Search Foundations team in Elasticsearch

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions