Skip to content

respect_word_boundaries: true breaks when first character of the search term is non-ASCII #1916

Closed as not planned
@spacekpe

Description

@spacekpe

I did:

  • Search for if my issue has already been submitted
  • Make sure I'm reporting something precise that needs to be fixed
  • Give my issue a descriptive and concise title
  • Create a minimal working example on JsFiddle or Codepen
    (or gave a link to a demo on the Selectize docs)
  • Indicate precise steps to reproduce in numbers and the result,
    like below

Non-ASCII/Unicode character at the beginning of an option string cannot be looked up using search.

Steps to reproduce:

  1. Use code from https://jsfiddle.net/w9gecnyo/4/
  2. Search for one of the two Unicode characters: "č" or "Č"

TL;DR Define two options, like "Čápkova" and "Ečerova", and then search for "č" or "Č" with respect_word_boundaries enabled (default).

Expected result:
Only option "Čápkova" should be listed (there is a match on the first letter, i.e. word boundary).

Actual result:
Only option "Ečerova" is listed - presumably because non-ASCII character does not act as a word boundary?!

As far as I can tell, this is caused by \b added in Sifter for respect_word_boundaries: true. This looks like problem with \b definition, so Unicode-aware word boundary detection needs some other trick.

This attempt at regex101.com seems to confirm that:

screenshot

SO seems to somewhat agree with this diagnosis:
https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions