Skip to content

Relax bare key restrictions to allow additional unicode letters and numbers #687

Closed
@marzer

Description

@marzer

Issue

TOML's "bare key" syntax is too restrictive. People who regularly use characters from languages other than English should be able to do so in TOML keys without additional gymnastics.

I know there's already been a lot of discussion about this but much of it was from when TOML was less established and I think it warrants revisiting.

Proposed change

Expand the set of accepted characters allowed in bare keys to include letters and numbers from the entire Unicode space, similar to how identifiers are handled in other Unicode-compliant contexts (e.g. python, javascript, etc.). Specifically:

  • Allow codepoints from categories Ll, Lm, Lo, Lt, Lu, Nd and Nl anywhere in a bare key
  • Allow codepoints from categories Mc and Mn anywhere in a bare key except as the first character

Rationale

After reading much of the existing discussion on the issue, I've identified the points below as being the main objections. I've written a counterpoint for each.

"ASCII-only is easy to understand"

Allowing Unicode letters and numbers wouldn't change the understandability of the written word in "mostly-ASCII" contexts, excepting maybe people from English-centric countries encountering characters they otherwise rarely see and being unsure how to pronounce them. I'm one of those people and my brain seems to consume them just fine. And it's almost certainly going to improve the understandability of bare keys to people for whom an ASCII environment is not their regular one.

It also wouldn't change the semantic/syntatic understandability of the language; I'm only advocating relaxing the spec to allow letter and number characters, not anything that might be confused for a language construct (no math symbols, for instance).

"Guides users to choose simple key names"

See above. I'd argue that the keys would be no less simple with this change. I live and work in a European country and a number of my friends and colleagues have non-ASCII letters in their name (e.g. ä). I doubt they consider their names to be complex; I certainly don't. If anything, by forcing people to jump through hoops just to type in their language, we're actually making the key names more complex w.r.t. cognitive load.

"Eliminate any weirdness that could come from having to deal with undelimited Unicode"

The TOML spec dictates UTF-8, not UTF-8-ish. UTF-8 is a solved problem at this point. If a parser doesn't correctly detect and handle malformed UTF-8, I'd argue that the parser needs fixing, not that we should bend over to accommodate users who are using crap tools and libraries. It's such a solved problem that you can even portably consume it using a state machine and validate it using vector intrinsics.

"Keys should be identifier-like"

Despite the fact that the concept of an "identifier" isn't a thing in TOML, I'll concede that in some situations this might be a concern. A reasonable example is using TOML in code generation contexts; if you used TOML keys to inform variable names historically you'd run into issues in many languages with non-ASCII characters, though this is no longer true. Even good old C++ supports unicode characters in identifiers on modern compilers.

...all of which is rendered moot by the fact that TOML supports hyphens in bare keys which are often invalid in identifier contexts, so this objection is a non-starter anyway.

"It complicates implementation"

It really doesn't. Many implementations will be able to leverage built-in helper functions or libraries for working with Unicode. For those that can't, I've put my money where my mouth is and implemented this as a proof-of-concept in my own TOML parser and I'm happy for my code to be used as a starting point:

Of course you might argue that simply accepting UTF-8 bytes from a TOML implementation is not an option for everyone, and you'd be right; there will always be situations where only ASCII makes sense (e.g. legacy codebases). I'd respond by pointing out that detecting non-ASCII characters in a character stream is laughably trivial. Applications requiring ASCII-only can easily enforce this themselves.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions