-
Notifications
You must be signed in to change notification settings - Fork 23
Description
As you probably know, in German, there are a lot of very long compound words.
So websites (e.g. newspapers) like to use automatic hyphenization, to add soft hyphen characters into them, so lines can break gracefully.
Normally, they are completely invisible, unless pasted into a program that cannot handle them (like for example a Linux terminal, where they turn into spaces).
This causes patterns that users expect to work, to fail. And it will be impossible for non-experts to even find out why.
One example: Anfängerinnen does not match Anfängerinnen. The first looks like An-fän-ge-rin-nen to searches, but where each - is a U+00AD.
This can be circumvented with regexes, of course, but the invisibility of those hyphens makes it cumbersome and write-only: An?fän?ge?rin?nen (Each question mark has a U+00AD in front of it) matches, but of course still fails if there are soft hyphens at unexpected places.
It would be much nicer, if this was automatically taken care of. Or, given that this will not be as common in other languages, maybe as a setting to enable?
Just removing all U+00AD before running any regexes is a quick and easy workaround that is definitely acceptable, in case you don’t want to dive deeply into how to do this without modifying the regex parse tree. ;)
It is also what I resorted to.
But of course it breaks the graceful line breaks, and leaves large gaps at the end of lines on narrower text columns.
Still, much better than mysterious failing matches and useless bug reports about a “broken regex” that I was about to write, just before I realized this.