Skip to content

Matches unexpectedly fail when there are soft hyphens (U+00AD) #393

@navid-zamani

Description

@navid-zamani

As you probably know, in German, there are a lot of very long compound words.
So websites (e.g. newspapers) like to use automatic hyphenization, to add soft hyphen characters into them, so lines can break gracefully.
Normally, they are completely invisible, unless pasted into a program that cannot handle them (like for example a Linux terminal, where they turn into spaces).

This causes patterns that users expect to work, to fail. And it will be impossible for non-experts to even find out why.

One example: An­fän­ge­rin­nen does not match Anfängerinnen. The first looks like An-fän-ge-rin-nen to searches, but where each - is a U+00AD.

This can be circumvented with regexes, of course, but the invisibility of those hyphens makes it cumbersome and write-only: An­?fän­?ge­?rin­?nen (Each question mark has a U+00AD in front of it) matches, but of course still fails if there are soft hyphens at unexpected places.

It would be much nicer, if this was automatically taken care of. Or, given that this will not be as common in other languages, maybe as a setting to enable?

Just removing all U+00AD before running any regexes is a quick and easy workaround that is definitely acceptable, in case you don’t want to dive deeply into how to do this without modifying the regex parse tree. ;)
It is also what I resorted to.
But of course it breaks the graceful line breaks, and leaves large gaps at the end of lines on narrower text columns.
Still, much better than mysterious failing matches and useless bug reports about a “broken regex” that I was about to write, just before I realized this.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions