Skip to content

[suggestion] revise unicode/binary mode decision #274

@avih

Description

@avih

As far as I can tell, POSIX awk is required to respect the current locale, but goawk doesn't do that. Instead, it behaves in byte mode by default, unless -c is specified, at which case it behaves in UTF-8 codepoints mode.

And while goawk probably can't do arbitrary locales, and ignoring bugs, it seems to have support for LC_CTYPE of either UTF-8 or plain bytes (ASCII/C ?).

So assuming it's desirable for goawk to try and respect the current locale where possible, I think it could look like this:

  • Add argument support for -b for binary/bytes mode. gawk has the same-ish -b as alias for --characters-as-bytes.
  • During arguments parsing, if -b or -c is specified (or replaced it with -u) then use the specified mode.
  • Else try to deduce it from the environment, like so:
    • If any of LC_ALL, LC_CTYPE, LC_LANG, in this override order is defined - even if empty (in go: os.LookupEnv(name)), stop the search and use its value:
      • if its tolower includes .utf8 or .utf-8 - enable UTF-8 mode, else enable byte/binary mode.
  • Else (no -b/-c and none of these vars is defined), pick some default, maybe depending on the platform (e.g. on Windows, and maybe also elsewhere, probably enable UTF-8 because that's what most text files are likely to be).

Does something like this make sense? I think it should be fairly trivial to implement, so the real question is whether such behavior is desirable, right?

Is there some assumption or empirical observation that awk scripts tend to behave better in goawk in one mode or the other?

Is there a meaningful performance impact depending on the unicode mode? I think in general bytes mode is typically faster, but considering that it might be hard for goawk to do regexp in bytes mode, does it still matter for goawk?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions