-
-
Notifications
You must be signed in to change notification settings - Fork 92
Description
As far as I can tell, POSIX awk is required to respect the current locale, but goawk doesn't do that. Instead, it behaves in byte mode by default, unless -c is specified, at which case it behaves in UTF-8 codepoints mode.
And while goawk probably can't do arbitrary locales, and ignoring bugs, it seems to have support for LC_CTYPE of either UTF-8 or plain bytes (ASCII/C ?).
So assuming it's desirable for goawk to try and respect the current locale where possible, I think it could look like this:
- Add argument support for
-bfor binary/bytes mode. gawk has the same-ish-bas alias for--characters-as-bytes. - During arguments parsing, if
-bor-cis specified (or replaced it with-u) then use the specified mode. - Else try to deduce it from the environment, like so:
- If any of
LC_ALL,LC_CTYPE,LC_LANG, in this override order is defined - even if empty (in go:os.LookupEnv(name)), stop the search and use its value:- if its tolower includes
.utf8or.utf-8- enable UTF-8 mode, else enable byte/binary mode.
- if its tolower includes
- If any of
- Else (no
-b/-cand none of these vars is defined), pick some default, maybe depending on the platform (e.g. on Windows, and maybe also elsewhere, probably enable UTF-8 because that's what most text files are likely to be).
Does something like this make sense? I think it should be fairly trivial to implement, so the real question is whether such behavior is desirable, right?
Is there some assumption or empirical observation that awk scripts tend to behave better in goawk in one mode or the other?
Is there a meaningful performance impact depending on the unicode mode? I think in general bytes mode is typically faster, but considering that it might be hard for goawk to do regexp in bytes mode, does it still matter for goawk?